Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

About

Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT). Then, we propose a Decision Mamba-Hybrid (DM-H) with the merits of transformers and Mamba in high-quality prediction and long-term memory. Specifically, DM-H first generates high-value sub-goals from long-term memory through the Mamba model. Then, we use sub-goals to prompt the transformer, establishing high-quality predictions. Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28$\times$ times faster than the transformer-based baselines.

Sili Huang, Jifeng Hu, Zhejian Yang, Liwei Yang, Tao Luo, Hechang Chen, Lichao Sun, Bo Yang• 2024

Related benchmarks

TaskDatasetResultRank
DarkroomGrid World
Offline Training Time (hour)0.18
6
Dark Key-to-DoorGrid World
Offline Training Time (hour)0.41
3
Darkroom HardGrid World
Offline Training Time (hour)0.2
3
HalfCheetahD4RL
Training Time (hour)20.96
3
HopperD4RL
Offline Training Time (hour)11.52
3
Large Dark Key-to-DoorLarge Grid World
Offline Training Time (hour)3.16
3
Large DarkroomLarge Grid World
Offline Training Time (hour)2.38
3
Large Darkroom DynamicLarge Grid World
Offline Training Time (hour)2.63
3
Large Darkroom HardLarge Grid World
Offline Training Time (hour)2.78
3
Walker2dD4RL
Offline Training Time19.96
3
Showing 10 of 10 rows

Other info

Follow for update