CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

About

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement99.6	1025
Robot Manipulation	LIBERO	Spatial Success Rate90.1	223
Robotic Manipulation	LIBERO	Long-horizon Success Rate94	165
Robotic Manipulation	SimplerEnv	Success Rate: Spoon on Towel66.7	60
Robotic Manipulation	SimplerEnv-Bridge WidowX robot (test)	Success Rate: Spoon on Towel66.7	13
Temporal Robotic Manipulation	Mikasa-Robo	SGT32	13
Robotic Manipulation	Mikasa-Robo	Intercept Medium5	6
Inference Efficiency	Inference Efficiency	Throughput (Hz)8.7	4
History usage and steerability analysis	n=200 source G3 G3b G4 environments	Decodability Score (A1)14.2	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord