Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

About

Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT). Then, we propose a Decision Mamba-Hybrid (DM-H) with the merits of transformers and Mamba in high-quality prediction and long-term memory. Specifically, DM-H first generates high-value sub-goals from long-term memory through the Mamba model. Then, we use sub-goals to prompt the transformer, establishing high-quality predictions. Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28$\times$ times faster than the transformer-based baselines.

Sili Huang, Jifeng Hu, Zhejian Yang, Liwei Yang, Tao Luo, Hechang Chen, Lichao Sun, Bo Yang• 2024

Related benchmarks

Task	Dataset	Result
Darkroom	Grid World	Offline Training Time (hour)0.18	6
Dark Key-to-Door	Grid World	Offline Training Time (hour)0.41	3
Darkroom Hard	Grid World	Offline Training Time (hour)0.2	3
HalfCheetah	D4RL	Training Time (hour)20.96	3
Hopper	D4RL	Offline Training Time (hour)11.52	3
Large Dark Key-to-Door	Large Grid World	Offline Training Time (hour)3.16	3
Large Darkroom	Large Grid World	Offline Training Time (hour)2.38	3
Large Darkroom Dynamic	Large Grid World	Offline Training Time (hour)2.63	3
Large Darkroom Hard	Large Grid World	Offline Training Time (hour)2.78	3
Walker2d	D4RL	Offline Training Time19.96	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord