Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling
About
Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT). Then, we propose a Decision Mamba-Hybrid (DM-H) with the merits of transformers and Mamba in high-quality prediction and long-term memory. Specifically, DM-H first generates high-value sub-goals from long-term memory through the Mamba model. Then, we use sub-goals to prompt the transformer, establishing high-quality predictions. Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28$\times$ times faster than the transformer-based baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Darkroom | Grid World | Offline Training Time (hour)0.18 | 6 | |
| Dark Key-to-Door | Grid World | Offline Training Time (hour)0.41 | 3 | |
| Darkroom Hard | Grid World | Offline Training Time (hour)0.2 | 3 | |
| HalfCheetah | D4RL | Training Time (hour)20.96 | 3 | |
| Hopper | D4RL | Offline Training Time (hour)11.52 | 3 | |
| Large Dark Key-to-Door | Large Grid World | Offline Training Time (hour)3.16 | 3 | |
| Large Darkroom | Large Grid World | Offline Training Time (hour)2.38 | 3 | |
| Large Darkroom Dynamic | Large Grid World | Offline Training Time (hour)2.63 | 3 | |
| Large Darkroom Hard | Large Grid World | Offline Training Time (hour)2.78 | 3 | |
| Walker2d | D4RL | Offline Training Time19.96 | 3 |