V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
About
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-horizon robotic manipulation | AutoCheckout Synthesized | -- | 3 | |
| Long-horizon robotic manipulation | PackBreads Synthesized | -- | 3 | |
| Long-horizon robotic manipulation | PackStationery Synthesized | -- | 3 | |
| Long-horizon robotic manipulation | SortToCabinet Synthesized | -- | 3 |