Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

About

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, He Wang• 2024

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)47
266
Vision-Language NavigationRxR-CE (val-unseen)
SR48.7
172
Vision-and-Language NavigationR2R-CE (val-seen)
SR58
49
Object Goal NavigationHM3D-OVON Seen (val)
SR41.3
44
Object Goal NavigationHM3D-OVON unseen (val)
Success Rate39.5
43
Object Goal NavigationHM3D-OVON Seen-Synonyms (val)
SR43.9
35
Object Goal NavigationHM3D v1 (val)
Success Rate (SR)73.7
34
Object NavigationHM3D v1 (val)
SR73.7
32
Open-set ObjectGoal NavigationHM3D-OVON unseen (val)
SR39.5
28
Object Goal NavigationHM3D (val)
SR73.7
21
Showing 10 of 31 rows

Other info

Follow for update