HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

About

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO (test)	Object Success Rate99.4	58
Robotic Manipulation	CALVIN D->D	--	40
Robotic Manipulation	Aloha-AgileX Real-World Basic Tasks (evaluation)	Average Success Rate63.7	7
Aggregate manipulation performance (All tasks)	XArm7 Real-world	Overall Avg Success Rate75	5
Block-on-Block manipulation	XArm7 Real-world	Pick Success Rate83.3	5
Cup-in-Cup manipulation	XArm7 Real-world	Pick Success Rate88.9	5
Fruit-to-Plate manipulation	XArm7 Real-world	Pick Success Rate81.3	5
Fold-Shorts	Aloha Real-world (evaluation)	Single Fold Success Rate80	4
Robot Manipulation	XArm7 Single-Arm Real-World Generalization (test)	Success Rate (Distractor)69.4	4
Scoop	Aloha Real-world (evaluation)	Place Success Rate100	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord