CRAFT: Video Diffusion for Bimanual Robot Data Generation
About
Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Lift Pot | RoboTwin Simulation UR5 to Franka Panda 14 | Success Rate82.6 | 4 | |
| Lift Roller | Real-World Augmentation Lighting | Success Rate17 | 4 | |
| Lift Roller | Real-World Augmentation Background | Success Rate18 | 4 | |
| Lift Roller | Real-World Augmentation Camera View | Success Rate19 | 4 | |
| Lift Roller | Real-World Augmentation Object Color | Success Rate90 | 4 | |
| LR | Real-World xArm7 to Franka Panda | Success Rate0.85 | 4 | |
| Place Cans | Real-World Augmentation Lighting | Success Rate70 | 4 | |
| Place Cans | Real-World Augmentation Background | Success Rate15 | 4 | |
| Place Cans | Real-World Augmentation Camera View | Success Rate18 | 4 | |
| Place Cans | Real-World Augmentation Object Color | Success Rate90 | 4 |