Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

About

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge. We also provide ablation study on vision-language ability retention on LIBERO-OOD (Out-of-Distribution) benchmark, with our method improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our code is available at https://github.com/Liang-ZX/DiscreteDiffusionVLA/tree/libero.

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement97.4
494
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)29.2
79
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate16.3
62
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Pick Coke Can Success Rate54.5
44
Robot ManipulationSimplerEnv WidowX Robot tasks
Average Success Rate1
26
Robotic ManipulationSIMPLER Google Robot Visual Matching
PickCan Success Rate85.4
24
Robotic ManipulationSIMPLER Google Robot VA
Pick Up Coke Can Success Rate82.5
20
Robotic ManipulationWidowX
Spoon Success Rate29.2
17
Robotic ManipulationGoogle Robot Variant Aggregation
Pick Success Rate82.5
15
Robotic ManipulationSimplerEnv Google Robot tasks (test)
Visual Matching (Pick Coke)85.4
14
Showing 10 of 11 rows

Other info

Follow for update