Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

About

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

Bhavya Agrawalla, Michal Nauman, Khush Agrawal, Aviral Kumar• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learningscene-play OGBench 5 tasks v0
Average Success Rate58
33
Offline Reinforcement LearningOGBench puzzle-4x4
Success Rate28
26
Offline Reinforcement LearningOGBench cube-triple (ct)
Success Rate4
25
Offline Reinforcement LearningOGBench puzzle-3x3
Average Task Success Rate37
9
Offline Reinforcement LearningOGBench scene
Average Task Success57
9
Offline Reinforcement LearningOGBench cube-double
Average Task Success47
9
Offline Reinforcement LearningOGBench puzzle-4x4-play (5 tasks)
Success Rate28
7
Offline Reinforcement LearningOGBench cube-double-play (5 tasks)
Success Rate47
7
Offline Reinforcement LearningOGBench antmaze-giant
Average Task Success51
6
Offline Reinforcement LearningOGBench hmmaze-large
Average Task Success28
6
Showing 10 of 11 rows

Other info

Follow for update