Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

About

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement98.6	957
Robotic Manipulation	LIBERO	Spatial Success Rate97.2	527
Robot Manipulation	LIBERO (test)	Average Success Rate96.3	220
Robot Manipulation	LIBERO Object	Success Rate96.6	127
Robot Manipulation	LIBERO	Spatial Success Rate97.2	116
Robot Manipulation	SimplerEnv WidowX	Success Rate: Put Spoon on Towel37.5	98
Robot Manipulation	SimplerEnv Google Robot tasks Variant Aggregation	Average Success Rate39.8	88
Robotic Manipulation	LIBERO v1 (test)	Average Success Rate96.3	83
Robot Manipulation	SimplerEnv WidowX Robot tasks (test)	Success Rate (Spoon)29.2	79
Robot Manipulation	SimplerEnv Google Robot tasks Visual Matching	Pick Coke Can Success Rate16.3	62

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord