Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Offline Reinforcement Learning as One Big Sequence Modeling Problem

About

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

Michael Janner, Qiyang Li, Sergey Levine• 2021

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score95
117
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score110
115
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score101.9
86
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score5.6
77
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score91.5
72
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score7.9
70
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return82.6
67
Offline Reinforcement LearningD4RL hopper-random
Normalized Score6.7
62
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score46.9
59
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score41.9
59
Showing 10 of 90 rows
...

Other info

Code

Follow for update