Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Score Regularized Policy Optimization through Diffusion Behavior

About

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.

Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score82.1
74
Offline Reinforcement LearningD4RL AntMaze
AntMaze Umaze Return97
65
Offline Reinforcement LearningOGBench
AntMaze Giant Navigate0.00e+0
56
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return69
53
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return61
53
Offline Reinforcement LearningD4RL antmaze-large (diverse)
Normalized Score53.6
47
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-expert
Normalized Score92.2
43
Offline Reinforcement LearningD4RL MuJoCo walker2d-medium-expert
Normalized Score114
36
Offline Reinforcement LearningD4RL antmaze-large (play)
Normalized Score53.6
36
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-replay
Normalized Score0.514
36
Showing 10 of 55 rows

Other info

Follow for update