Flow Q-Learning
About
We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: https://seohong.me/projects/fql/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| hopper locomotion | D4RL hopper medium-replay | Normalized Score85.4 | 56 | |
| Offline Reinforcement Learning | OGBench antmaze-large-navigate-singletask task1-v0 to task5-v0 | Score93 | 55 | |
| walker2d locomotion | D4RL walker2d medium-replay | Normalized Score82.1 | 53 | |
| Locomotion | D4RL walker2d-medium-expert | Normalized Score100.5 | 47 | |
| Locomotion | D4RL Walker2d medium | Normalized Score72.7 | 44 | |
| Offline Reinforcement Learning | D4RL antmaze-umaze (diverse) | Normalized Score89 | 40 | |
| Offline Reinforcement Learning | D4RL MuJoCo Hopper medium standard | Normalized Score68.1 | 36 | |
| Locomotion | D4RL HalfCheetah Medium-Replay | Normalized Score0.511 | 33 | |
| Offline Reinforcement Learning | D4RL Adroit pen (cloned) | Normalized Return74 | 32 | |
| Offline Reinforcement Learning | D4RL Adroit pen (human) | Normalized Return53 | 32 |