BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

About

Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kexuan Zhou, Sixu Lin, Litao Liu, Changqing Zou• 2025

Related benchmarks

Task	Dataset	Result
Locomotion	D4RL walker2d-medium-expert	Normalized Score110.4	90
walker2d locomotion	D4RL walker2d medium-replay	Normalized Score84.8	78
Offline Reinforcement Learning	D4RL antmaze-umaze (diverse)	Normalized Score48.8	74
Locomotion	D4RL Halfcheetah medium	Normalized Score48.7	70
Locomotion	D4RL Walker2d medium	Normalized Score84.2	70
Locomotion	D4RL HalfCheetah Medium-Replay	--	68
Offline Reinforcement Learning	D4RL antmaze-large (diverse)	Normalized Score16	47
Offline Reinforcement Learning	D4RL Maze2d-large	Normalized Performance48.3	39
Locomotion	D4RL hopper-medium-expert	Normalized Score (100k Steps)110.1	28
Locomotion	D4RL halfcheetah-medium-expert	Test Return94.1	22

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord