From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation
About
Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | RLBench | Success Rate (Unplug Charger)84.1 | 12 | |
| Robotic Manipulation | Real-robot manipulation tasks Aggregate | Average Success Rate (Avg SR)70 | 9 | |
| Cabinet Opening | Real-robot manipulation Dynamic Tasks | Success Rate50 | 4 | |
| Grasping | Real-robot manipulation Dynamic Tasks | Success Rate66.7 | 4 | |
| Cube Stowing | Real-robot manipulation Dynamic Tasks | Success Rate63.3 | 4 | |
| Kitchen Cleanup | Real-robot manipulation Static Tasks | Success Rate86.7 | 4 | |
| Microwave Loading | Real-robot manipulation Static Tasks | Success Rate83.3 | 4 |