SANTS: A State-Adaptive Scheduler for World Action Models
About
World Action Models (WAMs) improve robot manipulation by using video-based future representations to condition action generation. In pixel-space WAMs, however, the best action condition is not necessarily the fully denoised video. Controlled denoising-depth scans show that video refinement can reduce action error up to a state-dependent point, after which the gain may saturate or even reverse when late predictions become less action-relevant or physically unreliable. This suggests that action generation should use a state-dependent point along the video noise trajectory rather than a fixed terminal denoising depth. We introduce State-Adaptive Noise Trajectory Scheduler (SANTS), a lightweight scheduler for video-to-action diffusion policies. At each video decision point, SANTS reads the current video-state representation and noise level, then jointly predicts a cumulative stopping hazard and a relative noise-progression ratio. SANTS is post-trained with a path-level reward computed after the frozen action branch generates the final action chunk, so the scheduler is optimized for downstream action quality rather than intermediate video fidelity, while redundant video-state updates are explicitly penalized. Experiments show that SANTS reaches \(94.4\%\) overall success on RoboTwin 2.0 and \(73.1\%\) average success across seven real-robot tasks, while reducing latency by \(81.7\%\) and \(79.0\%\) relative to full video denoising, respectively. These results indicate that adaptive selection along the video noise trajectory can preserve the control benefits of WAM-style future reasoning while removing much of its redundant inference cost.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | RoboTwin 2.0 | Average Success Rate94.4 | 100 | |
| Backpack Packing | AgileX dual-arm Real-robot | Success Rate74 | 3 | |
| Charger insertion | AgileX dual-arm Real-robot | Success Rate58 | 3 | |
| Clothes folding | AgileX dual-arm Real-robot | Success Rate62 | 3 | |
| Fridge placement | UR10 kitchen Real-robot | Success Rate74 | 3 | |
| General Robotic Manipulation | Real-robot Tasks Aggregate | Mean Success Rate73.1 | 3 | |
| Plate transfer | UR10 kitchen Real-robot | Success Rate80 | 3 | |
| Sock placement | AgileX dual-arm Real-robot | Success Rate78 | 3 | |
| Fruit sorting | UR10 kitchen Real-robot | Success Rate86 | 3 |