Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Offline Reinforcement Learning with Generative Trajectory Policies

About

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score81.9
74
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return94.2
73
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return53.9
65
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return89.5
63
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return101.7
49
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return50.8
48
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return90.3
46
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return114.2
43
Offline Reinforcement LearningD4RL Gym hopper-medium-expert
Normalized Avg Return112.2
41
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-expert
Normalized Return93.8
40
Showing 10 of 23 rows

Other info

Follow for update