Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation
About
Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-horizon robot manipulation | Calvin ABCD→D | Task 1 Completion Rate96 | 96 | |
| Instruction-following robotic manipulation | CALVIN ABC→D (unseen environment D) | Success Rate (Length 1)96 | 29 | |
| Long-Horizon Multi-Task Language Control | CALVIN ABC→D (test) | Seq Success (1)96 | 13 | |
| Language-conditioned visuomotor control | CALVIN ABC→D (Zero-shot) | Completion Rate (Seq 1)96 | 8 | |
| Long-horizon robotic manipulation | AIRBOT Play real-world | Sub-task 1 Success Rate93.3 | 4 | |
| Stack two bowls | AIRBOT Play real-world | Success Rate86.7 | 4 | |
| Pour shrimp into plate | AIRBOT Play real-world | Success Rate80 | 4 |