Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

About

Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.

Taku Yamagata, Ahmed Khalil, Raul Santos-Rodriguez• 2022

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score79
169
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score94.2
161
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score101.7
132
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score52.1
109
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score42.3
105
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score67.1
104
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score35.6
97
LocomotionD4RL walker2d-medium-expert
Normalized Score108.8
90
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return63.7
85
walker2d locomotionD4RL walker2d medium-replay
Normalized Score77.2
78
Showing 10 of 68 rows

Other info

Follow for update