Avi: Action from Volumetric Inference

System Architecture

Avi combines stereo reconstruction, 2D segmentation (via Segment Anything), and a fine-tuned 3D Vision-Language Model (based on Qwen-VL and 3D VQVAE embeddings) to predict goal-conditioned 3D volumes. We further align these volumes using classical geometric optimization (ICP) to produce interpretable, spatially grounded actions.

Input Pipeline

Stereo RGB-D: Real/simulated left and right RGB images with depth
SAM Segmentation: Object-level segmentation for spatial grounding
3D Point Cloud: Reconstructed scene representation P_t

Processing Pipeline

3D VQVAE: Encodes/decodes voxelized 3D shapes into discrete tokens
Qwen 2.5 7B: Vision-language model with LoRA fine-tuning
Location Quantization: 896 spatial tokens (X,Y,Z position + scale)

Output Pipeline

Volumetric Prediction: Next state point cloud P̂_t+1
ICP Alignment: Geometric optimization for spatial transformation
Action Execution: Morphology-agnostic robot control via IK

Abstract

We present Avi (Action from Volumetric Inference), a novel 3D Vision-Language-Action (VLA) architecture that reframes robotic control as a problem of volumetric reasoning rather than low-level policy generation. By leveraging ShapeLLM-Omni as a 3D Multi-Modal Language Model and extending it with location quantization, we enable the model to interpret natural language instructions and predict goal-conditioned 3D representations of the environment. These predicted volumes are then aligned through geometric optimization, yielding interpretable and morphology-agnostic actions.

Our approach combines stereo reconstruction, 2D segmentation (via Segment Anything), and a fine-tuned 3D Vision-Language Model (based on Qwen-VL and 3D VQVAE embeddings) to predict goal-conditioned 3D volumes. We further align these volumes using classical geometric optimization (ICP) to produce interpretable, spatially grounded actions.

Key Contributions

AVI (Action from Volumetric Inference): A novel architecture that integrates a 3D Multi-Modal Language Model to infer actions through volumetric reasoning, rather than directly generating action tokens. This approach shifts the focus from language-to-action to language-to-geometry, enabling richer spatial grounding.
Location Quantization for 3D MLLMs: A general technique for discretizing spatial information that allows pretrained 3D MLLMs to generalize at the object level rather than at the scene level. Unlike current state-of-the-art methods, which often overfit to scene layouts, our quantization method enables modularity and reusability across different 3D architectures and robotics tasks.

000

Frame 0 of 18

Speed: 1.0x

Methodology

Architecture Overview

Our architecture extends the foundational 3D model, ShapeLLM-Omni, which is pretrained on large-scale 3D assets and capable of handling multi-modal inputs, including text, images, and 3D point clouds. We integrate Qwen-2.5 (7B), a state-of-the-art large vision-language model, with ShapeLLM-Omni to inherit both powerful linguistic grounding and native 3D spatial reasoning.

Location Quantization

We extend the vocabulary by introducing dedicated position and scale tokens. Specifically, we define three independent position axes: X, Y, Z ∈ {1,2,...,256}, each discretized into 256 bins, and discretize object scale into S ∈ {1,2,...,128}, yielding 128 scale tokens. This introduces a total of 896 additional tokens for spatial context.

Transformation Calculation

Given a prompt and current scene point cloud P_t, we generate a next point cloud prediction P̂_t+1 such that: P̂_t+1 ≈ P_t + ΔP where ΔP represents the learned spatial change conditioned on the prompt. We then compute the Iterative Closest Point (ICP) transformation to minimize alignment error.

The system processes both geometric and linguistic inputs by mapping them into a shared latent space Z. Let P denote the input point cloud and T the human-provided text prompt. We define modality-specific encoders such that: f_3D(P) ∈ Z, f_text(T) ∈ Z, where f_3D and f_text are encoders for the 3D point cloud and text, respectively. This joint embedding ensures that both geometry and language are represented in a unified feature space, enabling cross-modal reasoning.

For the language backbone, we integrate Qwen-2.5 (7B), a state-of-the-art large vision-language model with 7 billion parameters. Qwen-2.5 provides strong multi-turn instruction-following, chain-of-thought reasoning, and multilingual capabilities, making it particularly well-suited for language-conditioned robotics.

Robot Manipulator

Drawer Object

Experimental Results

Dataset and Training

We fine-tune the foundational model on robotics training data using the LIBERO Dataset, which provides diverse task demonstrations within the Robosuite environment. Our experiments focus on the drawer-closing task, demonstrating that Avi produces semantically consistent and physically realizable manipulations from only a small number of demonstrations (50 demos).

Task Performance

Our experiments on the drawer-closing task show successful execution across eighteen inference steps. The model demonstrates the ability to generate semantically consistent and physically realizable action trajectories conditioned on natural language instructions like "Close the drawer with the robot."

Ablation Studies

Through qualitative ablation studies, we demonstrate that location quantization is critical for precise manipulation. In tasks requiring fine-grained control, such as pick-and-place or insertion, the model must accurately predict gripper positions for correct end-effector motions.

With Location Quantization

✓ Reliable spatial grounding
✓ Consistent end-effector alignment
✓ Successful precision tasks
✓ Robust geometric reasoning

Without Location Quantization

✗ Ambiguous spatial predictions
✗ Suboptimal end-effector motions
✗ Failed precision tasks
✗ Poor geometric reasoning

Technical Details

Model Architecture

3D VQVAE Encoder/Decoder: Maps voxelized 3D shapes into discrete latent representations
Qwen 2.5 7B: Vision-Language Model backbone for language understanding
SAM (Segment Anything Model): For object segmentation and isolation

LoRA Injection: Efficient parameter updates without full model retraining
ICP (Iterative Closest Point): Geometric optimization for spatial alignment

Training Setup

Hardware: Single NVIDIA A6000 GPU with 48GB memory
Dataset: LIBERO Dataset with Robosuite demonstrations
Regularization: Dropout (p = 0.05) for limited-data regime
Fine-tuning: LoRA adaptation on last K attention layers

Comparison with Related Work

Unlike prior Vision-Language-Action (VLA) methods that directly predict robot-specific action tokens, our approach emphasizes morphology-agnostic policy: rather than outputting actions, our model predicts transformed 3D point clouds from which robot-specific trajectories can be computed via inverse kinematics.

Method	Input Modality	Core Mechanism	Generates 3D Point Clouds?	No Action Tokens Needed?
This Work (Avi)	3D point clouds + language	3D MLLM predicting delta point clouds + IK	✓ Yes	✓ Yes
Robot4DGen	RGB-D video	4D video generation with multi-view constraint	✗ No (video only)	✗ No
Unified Video-Action (UVA)	RGB video	Joint video–action latent modeling	✗ No	✗ No
DP3	3D point clouds	Diffusion model over actions conditioned on 3D	Uses 3D conditioning, outputs actions	✗ No
FP3	3D point clouds + language	Diffusion transformer policy pre-trained on 3D	✗ No (actions directly)	✗ No

This shift from language-to-action to language-to-geometry enables richer spatial grounding, efficient sim-to-real transfer, and more robust reasoning in multi-object robotic environments.

Future Work

In future work, we plan to extend Avi to multi-task and multi-robot settings, evaluate it under real-world deployment using stereo-based 3D reconstruction pipelines, and integrate reinforcement learning to further refine long-horizon planning capabilities.

Citation

@inproceedings{song2025avi,
  title={Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference},
  author={Harris Song and Long Le},
  booktitle={NeurIPS 2025 Workshop on Embodied World Models for Decision Making},
  year={2025},
  url={https://openreview.net/forum?id=3UB24EwYWV}
}