MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

About

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Spatial Success Rate88	223
Robot Manipulation	LIBERO Object	Success Rate87	139
Robotic Manipulation	LIBERO Long	Success Rate55	97
Robotic Manipulation	LIBERO Goal	Success Rate77	55
Robotic Manipulation	LIBERO Average across suites	Success Rate (SR)76	29
Robotic Manipulation	LIBERO Spatial	Success Rate (SR)85	28

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord