$\pi_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
About
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying RL to large-scale flow-based VLAs (\eg, $\pi_0$, $\pi_{0.5}$) remains challenging due to intractable action log-likelihoods raised from flow matching. We address this challenge with $\pi_{\texttt{RL}}$, featuring two technical approaches: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $\pi_{\texttt{RL}}$ across various benchmarks, with experiments demonstrating that RL yields significant performance improvements in both in-distribution and out-of-distribution settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | SimplerEnv WidowX Robot tasks | Average Success Rate7.96e+3 | 26 | |
| Put Carrot on Plate | SimplerEnv WidowX | Success Rate0.973 | 18 | |
| Put Spoon on Towel | SimplerEnv WidowX | Success Rate82.7 | 18 | |
| Stack Green on Yellow | SimplerEnv WidowX | Success Rate83.3 | 18 | |
| Put Eggplant in Basket | SimplerEnv WidowX | Success Rate55 | 18 | |
| Robotic Manipulation | ManiSkill3 | Stack Cube Success Rate72.3 | 15 |