Flow Matching Policy Gradients
About
Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | GenEval Score0.87 | 88 | |
| Text-to-Image Generation | ta | TA Score0.8159 | 14 | |
| Robotic Manipulation | FrankaKitchen N=1 | Task Accomplishment18 | 8 | |
| Robotic Manipulation | FrankaKitchen N=2 | Accomplished Tasks0.5 | 8 | |
| Robotic Manipulation | FrankaKitchen N=4 | Accomplished Tasks0.32 | 8 | |
| Continuous Control | MuJoCo Humanoid v5 | Maximum Average Return371.5 | 8 | |
| Continuous Control | MuJoCo Hopper v5 | Average Return1.22e+3 | 8 | |
| Continuous Control | MuJoCo HalfCheetah v5 | Max Return2.86e+3 | 8 | |
| Continuous Control | MuJoCo Walker2d v5 | Max Average Return635.3 | 8 | |
| Robotic Manipulation | FrankaKitchen N=7 | Accomplished Tasks0.66 | 8 |