Training-Time Action Conditioning for Efficient Real-Time Chunking

About

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $\pi_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Kevin Black, Allen Z. Ren, Michael Equi, Sergey Levine• 2025

Related benchmarks

Task	Dataset	Result
Simulated Robot Action Chunking	Kinetix (full-data)	Overall Return83.8	6
Real-world Robotic Manipulation (Mean)	Real-world	Mean Overall Success Rate34.7	5
Grocery Checkout Scanning	Real-world	S1 Cumulative Success Rate74	5
Put Back Block	Real-world	S1 Cumulative Success Rate56	5
Press Button	Real-world	S1 Cumulative Success Rate26	5
Multi-task reinforcement learning	Meta-World MT50	Success Rate (d=0)80	4
Robot Manipulation	Kinetix	Success Rate (d=0)89	4
Single-arm sorting	Real-robot single-arm sorting 10 physical trials (test)	Success Rate0.9	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord