Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Q-value Regularized Transformer for Offline Reinforcement Learning

About

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Conditional Sequence Modeling (CSM), a paradigm that learns the action distribution based on history trajectory and target returns for each state. However, these methods often struggle with stitching together optimal trajectories from sub-optimal ones due to the inconsistency between the sampled returns within individual trajectories and the optimal returns across multiple trajectories. Fortunately, Dynamic Programming (DP) methods offer a solution by leveraging a value function to approximate optimal future returns for each state, while these techniques are prone to unstable learning behaviors, particularly in long-horizon and sparse-reward scenarios. Building upon these insights, we propose the Q-value regularized Transformer (QT), which combines the trajectory modeling ability of the Transformer with the predictability of optimal future returns from DP methods. QT learns an action-value function and integrates a term maximizing action-values into the training loss of CSM, which aims to seek optimal actions that align closely with the behavior policy. Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods, highlighting the potential of QT to enhance the state-of-the-art in offline RL.

Shengchao Hu, Ziqing Fan, Chaoqin Huang, Li Shen, Ya Zhang, Yanfeng Wang, Dacheng Tao• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return94.2
68
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return49.1
60
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return87.6
58
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return102.1
44
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return48.9
43
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return78
41
Offline Reinforcement LearningD4RL Kitchen-Partial
Normalized Performance73.2
19
Offline Reinforcement LearningD4RL Adroit hammer-human v1
Normalized Score2.48e+3
9
Offline Reinforcement LearningD4RL Kitchen (kitchen-complete)
Normalized Score75
9
Offline Reinforcement LearningD4RL Adroit pen-cloned v1
Normalized Score90.1
9
Showing 10 of 10 rows

Other info

Follow for update