Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

You Can't Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments

About

Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.

Keiran Paster, Sheila McIlraith, Jimmy Ba• 2022

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningKitchen Partial
Normalized Score49.8
62
Offline Reinforcement Learninghopper medium
Normalized Score58
52
Offline Reinforcement Learningwalker2d medium
Normalized Score79.2
51
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score26.7
50
Offline Reinforcement Learninghopper medium-replay
Normalized Score48.6
44
Offline Reinforcement Learninghalfcheetah medium
Normalized Score44.4
43
Offline Reinforcement Learninghalfcheetah medium-replay
Normalized Score46.2
43
Offline Reinforcement Learningkitchen mixed
Normalized Score51
29
Offline Reinforcement LearningAntmaze umaze
Average Return74
24
Offline Reinforcement LearningAntmaze umaze-diverse
Average Return84
15
Showing 10 of 26 rows

Other info

Code

Follow for update