Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

About

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at

Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao, Liqiang Nie• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationLIBERO
Spatial Success Rate98.8
527
Robotic ManipulationManiSkill2
Picking Success Rate93
7
Average Long-horizon Task PerformanceReal-world Galaxea R1 Lite platform
Average Success Rate70
4
Drawer ArrangementReal-world Galaxea R1 Lite platform
Pull Score87
4
Microwave OperationReal-world Galaxea R1 Lite platform
Put Success Rate93
4
Average Long-horizon Task PerformanceReal-world AgileX Cobot Magic platform
Average Success Rate68.3
3
Banana PeelingReal-world Galaxea R1 Lite platform
Pick Score10
3
Drawer ArrangementReal-world AgileX Cobot Magic platform
Pull Score8.7
3
Microwave OperationReal-world AgileX Cobot Magic platform
Put Success Rate9.3
3
T-shirt FoldingReal-world Galaxea R1 Lite platform
Step 1 Score9.3
3
Showing 10 of 12 rows

Other info

Follow for update