Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?
About
Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Droid, BridgeData V2, and RoboCoin (test) | PSNR14.07 | 7 | |
| Click Alarmclock | RoboTwin 0° viewpoint | Success Rate80 | 6 | |
| Click Bell | RoboTwin 0° viewpoint | Success Rate38 | 6 | |
| Generative Video Synthesis | RoboTwin | PSNR (dB)18.839 | 5 | |
| Bell Pushing | Real-world setup Task 1 | Success Rate14 | 4 | |
| Fruit Pick-and-Place | Real-world setup Task 2 | Success Rate60 | 4 | |
| Lego Pick-and-Place | Real-world setup Task 3 | Success Rate13 | 4 |