State-Action Inpainting Diffuser for Continuous Control with Delay
About
Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | halfcheetah medium v2 | Average Score58.5 | 27 | |
| Offline Reinforcement Learning | halfcheetah medium-expert v2 | Normalized Score106.2 | 18 | |
| Offline Reinforcement Learning | walker2d medium v2 | Normalized Score84.8 | 18 | |
| Reinforcement Learning | MuJoCo HalfCheetah v5 | Mean Episodic Return1.48e+4 | 17 | |
| Reinforcement Learning | MuJoCo Ant v5 | Mean Episodic Return5.95e+3 | 17 | |
| Reinforcement Learning | MuJoCo Hopper v5 | Mean Episodic Return3.27e+3 | 17 | |
| Reinforcement Learning | MuJoCo Walker2d v5 | Mean Episodic Return5.22e+3 | 17 | |
| Reinforcement Learning | Task Average HC, Ant, Hop, Walk v5 | Mean Episodic Return7.28e+3 | 17 | |
| Offline Reinforcement Learning | halfcheetah medium-replay v2 | Normalized Score50.2 | 14 | |
| Offline Reinforcement Learning | hopper medium v2 | Normalized Score86.6 | 14 |