Learning to Act Robustly with View-Invariant Latent Actions
About
Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Coffee | Robosuite Seen views | Success Rate63 | 9 | |
| Coffee | Robosuite Unseen views | Success Rate12.65 | 9 | |
| Lift | Robosuite Seen views | Success Rate99.5 | 9 | |
| Lift | Robosuite Unseen views | Success Rate94.7 | 9 | |
| Mug Cleanup | Robosuite Seen views | Success Rate56.75 | 9 | |
| Mug Cleanup | Robosuite Unseen views | Success Rate0.2785 | 9 | |
| Square | Robosuite Seen views | Success Rate69 | 9 | |
| Square | Robosuite Unseen views | Success Rate19.8 | 9 | |
| Stack Three | Robosuite Seen views | Success Rate69 | 9 | |
| Stack Three | Robosuite Unseen views | Success Rate53.65 | 9 |