Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Critic-Guided Decision Transformer for Offline Reinforcement Learning

About

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.

Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return43
44
Offline Reinforcement LearningD4RL MuJoCo Hopper medium standard
Normalized Score96.9
36
Offline Reinforcement LearningD4RL MuJoCo Hopper-mr v2 (medium-replay)
Avg Normalized Score93.4
29
Offline Reinforcement LearningD4RL MuJoCo Walker2d-mr v2 (medium-replay)
Average Normalized Score78.1
29
Offline Reinforcement LearningMuJoCo hopper D4RL (medium-replay)
Normalized Return93.4
26
Offline Reinforcement LearningD4RL Mujoco Hopper-Medium-Expert v2
Normalized Score107.6
22
Offline Reinforcement LearningMuJoCo walker2d medium-replay D4RL
Normalized Return78.1
20
Offline Reinforcement LearningMuJoCo walker2d-medium D4RL
Normalized Return79.1
20
Offline Reinforcement LearningMuJoCo halfcheetah-medium-replay D4RL
Normalized Return40.4
20
Offline Reinforcement LearningMuJoCo halfcheetah-medium D4RL
Normalized Return43
20
Showing 10 of 18 rows

Other info

Follow for update