Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Diffusion Policies for Offline Reinforcement Learning

About

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, Shuicheng Yan• 2023

Related benchmarks

TaskDatasetResultRank
hopper locomotionD4RL hopper medium-replay
Normalized Score83
56
walker2d locomotionD4RL walker2d medium-replay
Normalized Score87
53
LocomotionD4RL walker2d-medium-expert
Normalized Score110.4
47
LocomotionD4RL Walker2d medium
Normalized Score86.5
44
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.449
33
NavigationD4RL antmaze-medium-play
Normalized Score73.3
22
NavigationD4RL antmaze-medium-diverse
Normalized Score52.7
22
Continuous ControlD4RL Hopper medium
Normalized Return72.6
19
LocomotionD4RL hopper-medium-expert
Normalized Score (100k Steps)110.8
18
NavigationD4RL antmaze-large-play (antmaze-l-p)
Normalized Score33.3
17
Showing 10 of 15 rows

Other info

Follow for update