Avi combines stereo reconstruction, 2D segmentation (via Segment Anything), and a fine-tuned 3D Vision-Language Model (based on Qwen-VL and 3D VQVAE embeddings) to predict goal-conditioned 3D volumes. We further align these volumes using classical geometric optimization (ICP) to produce interpretable, spatially grounded actions.
We present Avi (Action from Volumetric Inference), a novel 3D Vision-Language-Action (VLA) architecture that reframes robotic control as a problem of volumetric reasoning rather than low-level policy generation. By leveraging ShapeLLM-Omni as a 3D Multi-Modal Language Model and extending it with location quantization, we enable the model to interpret natural language instructions and predict goal-conditioned 3D representations of the environment. These predicted volumes are then aligned through geometric optimization, yielding interpretable and morphology-agnostic actions.
Our approach combines stereo reconstruction, 2D segmentation (via Segment Anything), and a fine-tuned 3D Vision-Language Model (based on Qwen-VL and 3D VQVAE embeddings) to predict goal-conditioned 3D volumes. We further align these volumes using classical geometric optimization (ICP) to produce interpretable, spatially grounded actions.
Our architecture extends the foundational 3D model, ShapeLLM-Omni, which is pretrained on large-scale 3D assets and capable of handling multi-modal inputs, including text, images, and 3D point clouds. We integrate Qwen-2.5 (7B), a state-of-the-art large vision-language model, with ShapeLLM-Omni to inherit both powerful linguistic grounding and native 3D spatial reasoning.
We extend the vocabulary by introducing dedicated position and scale tokens. Specifically, we define three independent position axes: X, Y, Z ∈ {1,2,...,256}, each discretized into 256 bins, and discretize object scale into S ∈ {1,2,...,128}, yielding 128 scale tokens. This introduces a total of 896 additional tokens for spatial context.
Given a prompt and current scene point cloud Pt, we generate a next point cloud prediction P̂t+1 such that: P̂t+1 ≈ Pt + ΔP where ΔP represents the learned spatial change conditioned on the prompt. We then compute the Iterative Closest Point (ICP) transformation to minimize alignment error.
The system processes both geometric and linguistic inputs by mapping them into a shared latent space Z. Let P denote the input point cloud and T the human-provided text prompt. We define modality-specific encoders such that: f3D(P) ∈ Z, ftext(T) ∈ Z, where f3D and ftext are encoders for the 3D point cloud and text, respectively. This joint embedding ensures that both geometry and language are represented in a unified feature space, enabling cross-modal reasoning.
For the language backbone, we integrate Qwen-2.5 (7B), a state-of-the-art large vision-language model with 7 billion parameters. Qwen-2.5 provides strong multi-turn instruction-following, chain-of-thought reasoning, and multilingual capabilities, making it particularly well-suited for language-conditioned robotics.
We fine-tune the foundational model on robotics training data using the LIBERO Dataset, which provides diverse task demonstrations within the Robosuite environment. Our experiments focus on the drawer-closing task, demonstrating that Avi produces semantically consistent and physically realizable manipulations from only a small number of demonstrations (50 demos).
Our experiments on the drawer-closing task show successful execution across eighteen inference steps. The model demonstrates the ability to generate semantically consistent and physically realizable action trajectories conditioned on natural language instructions like "Close the drawer with the robot."
Through qualitative ablation studies, we demonstrate that location quantization is critical for precise manipulation. In tasks requiring fine-grained control, such as pick-and-place or insertion, the model must accurately predict gripper positions for correct end-effector motions.
Unlike prior Vision-Language-Action (VLA) methods that directly predict robot-specific action tokens, our approach emphasizes morphology-agnostic policy: rather than outputting actions, our model predicts transformed 3D point clouds from which robot-specific trajectories can be computed via inverse kinematics.
Method | Input Modality | Core Mechanism | Generates 3D Point Clouds? | No Action Tokens Needed? |
---|---|---|---|---|
This Work (Avi) | 3D point clouds + language | 3D MLLM predicting delta point clouds + IK | ✓ Yes | ✓ Yes |
Robot4DGen | RGB-D video | 4D video generation with multi-view constraint | ✗ No (video only) | ✗ No |
Unified Video-Action (UVA) | RGB video | Joint video–action latent modeling | ✗ No | ✗ No |
DP3 | 3D point clouds | Diffusion model over actions conditioned on 3D | Uses 3D conditioning, outputs actions | ✗ No |
FP3 | 3D point clouds + language | Diffusion transformer policy pre-trained on 3D | ✗ No (actions directly) | ✗ No |
This shift from language-to-action to language-to-geometry enables richer spatial grounding, efficient sim-to-real transfer, and more robust reasoning in multi-object robotic environments.
In future work, we plan to extend Avi to multi-task and multi-robot settings, evaluate it under real-world deployment using stereo-based 3D reconstruction pipelines, and integrate reinforcement learning to further refine long-horizon planning capabilities.
@inproceedings{song2025avi,
title={Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference},
author={Harris Song and Long Le},
booktitle={NeurIPS 2025 Workshop on Embodied World Models for Decision Making},
year={2025},
url={https://openreview.net/forum?id=3UB24EwYWV}
}