DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

About

To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao• 2025

Related benchmarks

Task	Dataset	Result
Drawer Opening	SimplerEnv Google Robot embodiment (test)	Success Rate64	28
Pick Can	SimplerEnv Google Robot embodiment	Success Rate93.3	28
Move Near	SimplerEnv Google Robot embodiment	Success Rate75.3	28
General Robot Manipulation	SimplerEnv	Average Success Rate61	23
Put Carrot	SimplerEnv WidowX Robot embodiment	Success Rate50	13
Put Spoon	SimplerEnv WidowX Robot embodiment	Success Rate5.00e+3	13
stack blocks	SimplerEnv WidowX Robot embodiment	Success Rate8.3	13
Vision-Language-Action	VLA Evaluation Suite	A Score0.648	10
Robotic Manipulation	SimplerEnv	--	5
Handover Objects	Self-collected Real-world Data Galaxea R1-lite	Success Rate (O1)80	2

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord