SOLAMI: Pioneering Social Vision-Language-Action for Immersive 3D Character Interaction

·

The dream of interacting naturally with a 3D autonomous character through both speech and body language is at the heart of the next frontier in human-computer interaction. Traditional AI agents are primarily limited to text or voice, missing the rich, nuanced communication that defines human social interaction. SOLAMI (Social vision-Language-Action Modeling for Immersive interaction) represents a groundbreaking step toward bridging this gap. This end-to-end framework enables 3D characters to perceive, understand, and respond to users' multimodal inputs—speech and motion—in real-time, creating a truly immersive social experience.

Core Innovations of the SOLAMI Framework

SOLAMI is built upon three fundamental pillars that collectively address the significant challenges in creating socially intelligent 3D characters.

A Unified Social VLA Architecture

At its core, SOLAMI utilizes a novel Vision-Language-Action (VLA) model architecture. Unlike modular "LLM-Agent" approaches that chain together separate subsystems for motion captioning, speech recognition, and text-to-motion generation, SOLAMI processes everything within a single, unified model. This end-to-end design is crucial for preserving the subtle, high-frequency nuances of social interaction that are often lost when translating multimodal data into text intermediaries.

The architecture is built upon a decoder-only large language model (LLM) backbone. User inputs—both speech audio and body motion—are converted into discrete tokens via specialized tokenizers. The LLM then predicts the character's responsive speech and motion tokens based on this input and the character's predefined profile. These tokens are finally decoded back into audible speech and animated motion.

This approach allows the model to learn intricate behavior patterns that span both motion and speech modalities simultaneously, resulting in more coherent and contextually appropriate responses. Crucially, it also achieves significantly lower latency compared to multi-stage pipelines, a necessity for natural, real-time conversation.

The SynMSI Dataset: Solving the Data Scarcity Problem

A major impediment to developing social VLA models is the extreme scarcity of high-quality, multimodal interaction data. Capturing real-world dyadic interactions with synchronized motion, speech, and context is prohibitively expensive and complex.

To overcome this, the researchers developed an innovative synthetic data generation pipeline that creates the SynMSI (Synthetic Multimodal Social Interaction) dataset. This pipeline intelligently leverages existing, large-scale text-motion datasets. It starts with a vast collection of over 5,300 character-relevant and daily topics. Using advanced LLMs, it generates multi-turn textual dialogue scripts. It then retrieves the most contextually appropriate motion sequences from a meticulously curated database of 46,000 motions to pair with the dialogue.

A key innovation is a refinement step that ensures the generated speech text is well-coordinated with the retrieved motions, mitigating a common misalignment issue in synthetic data. The final speech audio is synthesized using text-to-speech (TTS) and voice cloning techniques to maintain consistency with a specific character's voice. The result is a high-fidelity dataset of 6,300 multi-turn multimodal conversation items, all created automatically from existing resources.

An Immersive VR Interface for Real-World Evaluation

Theory and metrics are one thing; human perception is another. To validate SOLAMI's performance in a realistic setting, the team developed a fully functional Virtual Reality interface. Built for the Oculus Quest 3, this system allows users to step into a virtual space and interact with characters driven by different AI models (including SOLAMI and various baselines) in a controlled A/B testing environment.

The VR headset captures the user's speech and full-body motion. This data is sent to a backend server equipped with powerful GPUs, which runs the AI model to generate the character's responses—returning speech audio, body motion parameters, and facial animation. The system retargets these outputs onto detailed 3D character models, completing the immersive feedback loop. This setup was instrumental in conducting a rigorous user study to measure the perceived quality of interactions.

How SOLAMI Outperforms Existing Methods

Rigorous quantitative experiments and user studies demonstrate SOLAMI's superiority over previous approaches like text-only LLM agents (LLM+Speech) or modular frameworks (e.g., DLP).

Quantitative Results: On objective metrics, the fully-trained SOLAMI model achieved:

User Study Results: In immersive VR tests, users rated their interactions with SOLAMI-driven characters higher across all dimensions compared to baselines:

These results validate the core hypothesis: an end-to-end social VLA model, trained on high-quality synthetic multimodal data, is far more effective at modeling the complex, intertwined nature of social behavior than systems that process modalities separately.

👉 Explore advanced methods for AI character creation

The Technical Breakdown: How SOLAMI Works

Architecture and Tokenization

SOLAMI treats speech and motion as new languages, adding them to the LLM's existing vocabulary.

  1. Motion Representation: Instead of using 3D joint positions, SOLAMI uses SMPL-X joint rotations. This representation is directly compatible with industry-standard animation workflows, avoiding the visual artifacts and time-consuming fitting processes required when using keypoints.
  2. Motion Tokenizer: A set of separate Vector Quantized VAEs (VQ-VAEs) discretize the body motion, hand motion, and the relative transformation between characters into tokens. This separation allows for higher reconstruction accuracy.
  3. Speech Tokenizer: SOLAMI uses SpeechTokenizer, which disentangles semantic content from acoustic details. This allows the model to process only the semantic tokens, drastically reducing computational cost while enabling instance-wise voice cloning during decoding.

The Three-Stage Training Strategy

Training SOLAMI is a complex process executed in three distinct stages:

  1. Tokenizer Training: The motion VQ-VAEs are trained to accurately encode and decode motion sequences into a discrete token space. The pre-trained SpeechTokenizer is frozen at this stage.
  2. Multi-task Pre-training for Modality Alignment: This critical stage aligns the motion and speech modalities with the text modality. The model is trained on large datasets of motion-text pairs and speech-text pairs for tasks like text-to-motion generation, motion captioning, text-to-speech, and speech recognition. This provides the model with a foundational understanding of how language relates to movement and sound.
  3. Instruction Tuning for Multi-turn Conversation: Finally, the model is fine-tuned on the SynMSI dataset. This teaches it the specific skill of engaging in extended, context-aware, multimodal dialogues, generating appropriate speech and motion responses based on user input and character identity.

Frequently Asked Questions

What is a Social VLA model?
A Social Vision-Language-Action (VLA) model is an AI system designed to process and generate multiple modalities—specifically vision (or motion), language, and action—within a social interaction context. Unlike standard VLAs built for robotics, a Social VLA focuses on understanding and producing the nuanced behaviors of human-like conversation, including body language, tone of voice, and speech content.

Why is an end-to-end architecture better than a modular system for this task?
Modular systems process speech, text, and motion in separate stages, using text as an intermediary. This process loses subtle information (e.g., the exact cadence of a gesture or the emotional tone of a sigh) and introduces significant latency as data passes through each module. An end-to-end model processes all modalities simultaneously within a single system, preserving these nuances and enabling faster, more natural responses.

How was the training data for SOLAMI created?
The SynMSI dataset was created synthetically using an automated pipeline. It starts by generating diverse dialogue scripts using a powerful LLM guided by thousands of curated topics. It then pairs these dialogues with the most appropriate motion sequences retrieved from a large database of existing motion-capture data. Finally, it uses text-to-speech and voice cloning to generate character-consistent speech audio, ensuring the entire dataset is aligned and high-quality.

What are the practical applications of this technology?
The applications are vast and include:

Can users interact with any 3D character using this system?
Yes, the system is designed to be character-agnostic. As long as a 3D model is rigged to a standard skeleton (like SMPL-X), SOLAMI can drive it. The model's ability to clone a specific voice from a short sample also allows it to embody a wide range of characters with unique vocal identities.

What are the current limitations and future directions for SOLAMI?
Current limitations include handling interactions with objects and environments, the challenges of cross-embodiment (adapting motions perfectly to very different character body types), and managing very long-term conversations. Future work may incorporate video input, collect real human-character interaction data, and develop more efficient learning methods to handle the long-tail distribution of human motions.

👉 Discover next-generation VR interaction tools

Conclusion: The Future of Human-Character Interaction

SOLAMI represents a significant leap forward from simple chatbots to embodied, socially intelligent characters. By integrating a novel end-to-end architecture, a scalable data synthesis solution, and a robust evaluation platform, it provides a comprehensive framework for building the next generation of interactive AI. The results demonstrate that unifying perception, reasoning, and action within a single model is not just feasible but essential for achieving the low latency and high fidelity required for immersive social experiences. This work lays a strong foundation for a future where interacting with an AI character feels as natural and rich as interacting with another person.