Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

About

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO (test)	Average Success Rate97.8	220
Robotic task execution	LIBERO	Average Success Rate98.4	44
Robotic Manipulation	RoboTwin Easy 2.0	Adjust Bottle Success Rate97	19
Hang Cups	Real-world (Unseen)	Success Rate56	13
Pick-&-Place	Real-world (Unseen)	Success Rate40	9
Robotic task execution	Real-world	Push Blocks Success Rate81.7	8
Clean table	Real-world (Unseen)	Success Rate48	8
Multi-task Robot Manipulation	Real-world (Unseen)	Average Success Rate44	6
Push Blocks	Real-world (Unseen)	Success Rate45	6
Robotic task execution	RoboTwin Hard 2.0	Adjust Bottle Success Rate26	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord