Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Flow Matching Policy Gradients

About

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
GenEval Score0.87
88
Text-to-Image Generationta
TA Score0.8159
14
Robotic ManipulationFrankaKitchen N=1
Task Accomplishment18
8
Robotic ManipulationFrankaKitchen N=2
Accomplished Tasks0.5
8
Robotic ManipulationFrankaKitchen N=4
Accomplished Tasks0.32
8
Continuous ControlMuJoCo Humanoid v5
Maximum Average Return371.5
8
Continuous ControlMuJoCo Hopper v5
Average Return1.22e+3
8
Continuous ControlMuJoCo HalfCheetah v5
Max Return2.86e+3
8
Continuous ControlMuJoCo Walker2d v5
Max Average Return635.3
8
Robotic ManipulationFrankaKitchen N=7
Accomplished Tasks0.66
8
Showing 10 of 25 rows

Other info

Follow for update