DSSP: Diffusion State Space Policy with Full-History Encoding
About
Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | RoboTwin 2.0 (test) | Average Success Rate62.3 | 30 | |
| Robotic Manipulation | Adroit and MetaWorld | Average Success Rate80.1 | 28 | |
| Robotic Manipulation Success | MetaWorld | Success Rate (Easy)90.5 | 7 | |
| Robotic Manipulation Success | Adroit | Success Rate73 | 7 | |
| Robot Manipulation | Real-world Put Bottles | Success Rate0.6 | 4 | |
| Robot Manipulation | Real-world Object Swap | Success Rate65 | 4 | |
| Robot Manipulation | Real-world Morse Tapping | Success Rate85 | 4 |