Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

About

Visual-Language-Action (VLA) models have demonstrated strong cross-scenario generalization capabilities in various robotic tasks through large-scale pre-training and task-specific fine-tuning. However, their training paradigm mainly relies on manually collected successful demonstrations, making it difficult to adapt to complex environments when encountering out-of-distribution (OOD) scenarios or execution biases. While Reinforcement Learning (RL) provides a closed-loop optimization framework via active trial-and-error mechanism, it suffers from sparse rewards, high variance, and unstable optimization in long-horizon robotic tasks. To address these limitations, we propose Trajectory-based Group Relative Policy Optimization (TGRPO), an online RL-based training framework for VLA models. TGRPO leverages task analysis generated by a large language model to automatically construct dense reward functions, providing fine-grained feedback to accelerate convergence and improve credit assignment. The core of our method is a group-based strategy that samples and normalizes multiple trajectories in parallel, reducing variance through relative comparison. By integrating trajectory-level and step-level advantage estimation, TGRPO captures both global and local optimization signals without relying on a value network. Experiments on four task categories of the LIBERO benchmark demonstrate that TGRPO achieves an average success rate of 80.7\%, which is 4.2\% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.

Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, Zipei Fan• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Object Achievement92.2
957
Robot ManipulationLIBERO Object
Success Rate89
127
Robot ManipulationLIBERO
Spatial Success Rate90
116
Robotic ManipulationLIBERO Long
Success Rate59.2
91
Robotic ManipulationLIBERO Goal
Success Rate79
42
Robotic ManipulationLIBERO Average across suites
Success Rate (SR)76
29
Robotic ManipulationLIBERO Spatial
Success Rate (SR)84
28
Multi-task LearningLIBERO
Object Score92.2
18
Showing 8 of 8 rows

Other info

Follow for update