Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model
About
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO (test) | Average Success Rate97.8 | 184 | |
| Robotic task execution | LIBERO | Average Success Rate98.4 | 26 | |
| Robotic Manipulation | RoboTwin Easy 2.0 | Adjust Bottle Success Rate97 | 11 | |
| Robotic task execution | Real-world | Push Blocks Success Rate81.7 | 8 | |
| Clean table | Real-world (Unseen) | Success Rate48 | 6 | |
| Hang Cups | Real-world (Unseen) | Success Rate56 | 6 | |
| Multi-task Robot Manipulation | Real-world (Unseen) | Average Success Rate44 | 6 | |
| Push Blocks | Real-world (Unseen) | Success Rate45 | 6 | |
| Pick-&-Place | Real-world (Unseen) | Success Rate40 | 6 | |
| Robotic task execution | RoboTwin Hard 2.0 | Adjust Bottle Success Rate26 | 3 |