Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies

About

Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.

Clemence Grislain, Mathilde Kappel, Olivier Sigaud, Mohamed Chetouani• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	CALVIN D->D	Average Length4.28	40
Language-conditioned manipulation	CALVIN MTLC	Success Rate95	12
Language-conditioned manipulation	CALVIN LH-MTLC	Success Rate (1 Instruction)97.5	10

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord