TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

About

The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to "catastrophic forgetting" that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called "Left Brain") and a trainable specialist (also called "Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation conditions a flow-matching action expert for precise continuous control. Empirical results on SimplerEnv and RoboCasa benchmarks demonstrate that by explicitly retaining general capabilities, TwinBrainVLA achieves substantial performance gains over baseline models in complex manipulation tasks.

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	SimplerEnv OOD	Put Spoon on Towel Success Rate87.5	19
Robot Manipulation	LIBERO (All four suites (combined))	Spatial Success Rate99.2	12
Robot Manipulation	RoboCasa Tabletop official	Avg Success Rate0.546	8
Pick-&-Place	Franka Research 3 Out-of-Domain zero-shot	Success Rate5.00e+3	5
Pick-&-Place	Franka Research 3 Pick-All (long-horizon)	Success Rate1.00e+3	5
Pick-&-Place	Franka Research 3 In-Domain	Success Rate93.3	5

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord