Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

About

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at https://github.com/ZhHe11/GSFlow-RL.

He Zhang, Ying Sun, Hui Xiong• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL AntMaze
AntMaze Umaze Return99.6
65
Offline Reinforcement LearningOGBench
AntMaze Large Navigate88.4
27
Offline Reinforcement LearningOGBench Visual
Visual Cube Single Task 1 Success Rate92.7
11
Robotic ManipulationOGBench scene-play
Success Rate (Offline)88
9
NavigationOGBench humanoidmaze-medium-navigate
Success Rate (Offline)5
9
ManipulationOGBench Cube Double Play Offline → Online
Success Rate (Offline)51
3
Maze NavigationD4RL AntMaze U-Maze Offline → Online
Success Rate (Offline)100
3
Maze NavigationD4RL AntMaze U-Maze Diverse Offline → Online
Success Rate (Offline)93
3
Maze NavigationD4RL AntMaze Medium Play Offline → Online
Success Rate (Offline)77
3
Maze NavigationD4RL AntMaze Medium Diverse Offline → Online
Offline Success Rate76
3
Showing 10 of 14 rows

Other info

Follow for update