GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

About

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO (test)	Average Success Rate96.6	237
Robotic Manipulation	RoboCasa	Average Success Rate52.4	68
Robot Manipulation	LIBERO	Spatial Success Rate98	58
Robotic Manipulation	LIBERO Evaluation Suites	Average Success Rate96.5	12
Robot Manipulation	RoboCasa Human-50	Pick & Place Success Rate22.7	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord