Decision S4: Efficient Sequence-Based RL via State Spaces Layers

About

Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL.

Shmuel Bar-David, Itamar Zimerman, Eliya Nachmani, Lior Wolf• 2023

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score92.7	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score110.8	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score105.7	132
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score49.6	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score42.5	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score78	104
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score15.2	97
Offline Reinforcement Learning	D4RL walker2d medium-replay	Normalized Score69	62
Offline Reinforcement Learning	D4RL Hopper Medium v2	Normalized Score54.7	36
Offline multitask Reinforcement Learning	Franka Kitchen kitchen-mixed	Average Episodic Return47.7	23

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord