Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

About

Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose $M^2$-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.

Siyao Xiao, Yuhong Zhang, Zhifang Liu, Zihan Gao, Jingye Zhang, Sinwai Choo, Dake Zhong, Mengzhe Wang, Xiao Lin, Xianfeng Zhou, Jia Jia, Haoqian Wang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Spatial Success Rate97.8
46
Instruction FollowingReal-world Robot Manipulation Generalization v1 (test)
Success Rate80
4
Novel objects manipulationReal-world Robot Manipulation Generalization v1 (test)
Success Rate75
4
Pick apple into basketReal-world Robot Manipulation Fundamental v1 (test)
Success Rate85
4
Pour water into bowlReal-world Robot Manipulation Fundamental v1 (test)
Success Rate90
4
Showing 5 of 5 rows

Other info

Follow for update