Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

About

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO (test)
Average Success Rate97.8
184
Robotic task executionLIBERO
Average Success Rate98.4
26
Robotic ManipulationRoboTwin Easy 2.0
Adjust Bottle Success Rate97
11
Robotic task executionReal-world
Push Blocks Success Rate81.7
8
Clean tableReal-world (Unseen)
Success Rate48
6
Hang CupsReal-world (Unseen)
Success Rate56
6
Multi-task Robot ManipulationReal-world (Unseen)
Average Success Rate44
6
Push BlocksReal-world (Unseen)
Success Rate45
6
Pick-&-PlaceReal-world (Unseen)
Success Rate40
6
Robotic task executionRoboTwin Hard 2.0
Adjust Bottle Success Rate26
3
Showing 10 of 10 rows

Other info

Follow for update