Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Act Robustly with View-Invariant Latent Actions

About

Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.

Youngjoon Jeong, Junha Chun, Taesup Kim• 2026

Related benchmarks

TaskDatasetResultRank
CoffeeRobosuite Seen views
Success Rate63
9
CoffeeRobosuite Unseen views
Success Rate12.65
9
LiftRobosuite Seen views
Success Rate99.5
9
LiftRobosuite Unseen views
Success Rate94.7
9
Mug CleanupRobosuite Seen views
Success Rate56.75
9
Mug CleanupRobosuite Unseen views
Success Rate0.2785
9
SquareRobosuite Seen views
Success Rate69
9
SquareRobosuite Unseen views
Success Rate19.8
9
Stack ThreeRobosuite Seen views
Success Rate69
9
Stack ThreeRobosuite Unseen views
Success Rate53.65
9
Showing 10 of 16 rows

Other info

Follow for update