$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

About

Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose $M^2$-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.

Siyao Xiao, Yuhong Zhang, Zhifang Liu, Zihan Gao, Jingye Zhang, Sinwai Choo, Dake Zhong, Mengzhe Wang, Xiao Lin, Xianfeng Zhou, Jia Jia, Haoqian Wang• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Spatial Success Rate97.8	46
Instruction Following	Real-world Robot Manipulation Generalization v1 (test)	Success Rate80	4
Novel objects manipulation	Real-world Robot Manipulation Generalization v1 (test)	Success Rate75	4
Pick apple into basket	Real-world Robot Manipulation Fundamental v1 (test)	Success Rate85	4
Pour water into bowl	Real-world Robot Manipulation Fundamental v1 (test)	Success Rate90	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord