Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

About

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge. We also provide ablation study on vision-language ability retention on LIBERO-OOD (Out-of-Distribution) benchmark, with our method improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our code is available at https://github.com/Liang-ZX/DiscreteDiffusionVLA/tree/libero.

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement97.4
700
Robotic ManipulationLIBERO
Spatial Success Rate97.2
314
Robot ManipulationLIBERO (test)
Average Success Rate96.3
184
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)29.2
79
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Average Success Rate39.8
67
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate16.3
62
Robotic ManipulationSIMPLER Google Robot VA
Pick Up Coke Can Success Rate82.5
35
Robot ManipulationSimplerEnv WidowX Robot tasks
Average Success Rate1
32
Robotic ManipulationSIMPLER Google Robot Visual Matching
PickCan Success Rate85.4
24
Robot ManipulationSimpler-Bridge v1 (test)
Success Rate (Spoon)29.2
21
Showing 10 of 15 rows

Other info

Follow for update