Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

About

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the Single-Step Completion Policy (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making. The code is available at https://github.com/PrajwalKoirala/SSCP-Single-Step-Completion-Policy.

Prajwal Koirala, Cody Fleming• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return85.9
73
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return52.3
65
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return84.2
63
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return101.4
49
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return44.4
48
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return102.4
46
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return111.1
43
Offline Reinforcement LearningD4RL Gym hopper-medium-expert
Normalized Avg Return110.9
41
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-expert
Normalized Return98.1
40
Showing 9 of 9 rows

Other info

Follow for update