Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

About

Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.

Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine• 2025

Related benchmarks

TaskDatasetResultRank
Image-Goal NavigationImage-Goal Navigation
SR36.67
7
Language-Goal NavigationLanguage-Goal Navigation
Success Rate (SR)62.67
6
Point-Goal navigationPoint-Goal Navigation
SR40
6
Robot navigationDTEL (test)
Collision Rate (%)12.6
6
Language-conditioned navigationReal-world Robot Navigation Environment OOD Language Prompts
Language Following83
5
2D pose-conditioned navigationReal-world Robot Navigation Environment
Success Rate45
5
Robot navigationDAUG (test)
Collision Rate14.2
5
Image-Goal NavigationReal-world (Mobile Robot Platform) (test)
Success Rate0.52
4
Point-Goal navigationReal-world (Mobile Robot Platform) (test)
Success Rate76
3
Language-Goal NavigationReal-world (Mobile Robot Platform) (test)
Success Rate56
3
Showing 10 of 10 rows

Other info

Follow for update