Reinforcement Learning with Action Chunking
About
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | Robomimic Can | Success Rate94 | 12 | |
| Robotic Manipulation | Robomimic Lift | Success Rate100 | 12 | |
| Robotic Manipulation | Robomimic Square | Success Rate92 | 12 | |
| Offline Goal-Conditioned Reinforcement Learning | humanoidmaze giant | Success Rate4.80e+3 | 10 | |
| Offline Goal-Conditioned Reinforcement Learning | puzzle 4x5 | Success Rate2.00e+3 | 10 | |
| Offline Goal-Conditioned Reinforcement Learning | puzzle-4x6-1B | Success Rate2.80e+3 | 10 | |
| Offline Goal-Conditioned Reinforcement Learning | cube-quadruple 100M | Success Rate35 | 10 | |
| Offline Goal-Conditioned Reinforcement Learning | cube-triple 100M | Success Rate20 | 10 | |
| Offline Goal-Conditioned Reinforcement Learning | cube-octuple-1B | Success Rate0.00e+0 | 10 | |
| Language-guided robot manipulation | LIBERO-Spatial 5-shot (test) | Success Rate46 | 5 |