Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
About
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Object Achievement98.6 | 957 | |
| Robotic Manipulation | LIBERO | Spatial Success Rate97.2 | 527 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate96.3 | 220 | |
| Robot Manipulation | LIBERO Object | Success Rate96.6 | 127 | |
| Robot Manipulation | LIBERO | Spatial Success Rate97.2 | 116 | |
| Robot Manipulation | SimplerEnv WidowX | Success Rate: Put Spoon on Towel37.5 | 98 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Variant Aggregation | Average Success Rate39.8 | 88 | |
| Robotic Manipulation | LIBERO v1 (test) | Average Success Rate96.3 | 83 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)29.2 | 79 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Visual Matching | Pick Coke Can Success Rate16.3 | 62 |