Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

About

Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30\% to about 90\% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, Jiangmiao Pang• 2025

Related benchmarks

TaskDatasetResultRank
Pairwise progress-judgmentRoboPulse Large hop range
Accuracy (Real)79
11
Pairwise progress-judgmentRoboPulse Overall
Overall Average Accuracy71
11
Pairwise progress-judgmentRoboPulse Small hop range
Accuracy (Real)60
11
Pairwise progress-judgmentRoboPulse Medium hop range
Accuracy (Real)67
11
Reward alignmentRBM-EVAL ID
Pearson r (VOC)0.16
8
Reward alignmentRBM-EVAL OOD
Pearson r (VOC)0.17
8
Trajectory RankingRBM OOD 1.0 (test)
Kendall's Tau-a0.08
8
Task Completion ClassificationSARM (real-world rollouts)
Average Accuracy33.9
8
Video Frame Rank-CorrelationDROID
VOC Rank-Correlation (Sparse)0.66
6
Video Frame Rank-CorrelationAGIBOT-World
VOC (Sparse)0.29
6
Showing 10 of 16 rows

Other info

Follow for update