Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

About

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding• 2025

Related benchmarks

TaskDatasetResultRank
Continuous ControlMuJoCo Ant v4
Average Return5.85e+3
46
Continuous ControlMuJoCo Walker2d v4--
39
Continuous ControlMuJoCo HalfCheetah v4
Average Return1.49e+4
36
Continuous ControlMuJoCo Swimmer v4
Total Reward101.5
19
Continuous ControlMuJoCo Humanoid v4 (test)
Mean Episodic Return1.00e+4
10
Continuous ControlMuJoCo Ant v4 (test)
Mean Episodic Return5.31e+3
10
Continuous ControlMuJoCo HalfCheetah v4 (test)
Mean Episodic Return1.26e+4
10
Continuous ControlMuJoCo Hopper v4 (test)
Mean Episodic Return3.34e+3
6
Continuous ControlMuJoCo InvertedPendulum v4 (test)
Mean Episodic Return25
6
Continuous ControlMuJoCo Reacher v4 (test)
Mean Episodic Return-8
6
Showing 10 of 10 rows

Other info

Follow for update