Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Navigation World Models

About

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun• 2024

Related benchmarks

TaskDatasetResultRank
Visual generation2D trajectory dataset
LPIPS0.377
16
World modeling for ego-centric navigationRECON (val)
LPIPS0.26
12
Goal Conditioned Visual NavigationRECON
ATE1.13
11
Visual generation3D trajectory dataset
LPIPS0.376
8
Image-Goal NavigationImage-Goal Navigation
SR43.33
7
Goal Conditioned Visual NavigationHuRON
ATE3.68
6
Goal Conditioned Visual NavigationTartan Drive
ATE5.63
6
Goal Conditioned Visual NavigationSCAND
ATE1.28
6
Point-Goal navigationPoint-Goal Navigation
SR52.67
6
Language-Goal NavigationLanguage-Goal Navigation
Success Rate (SR)51.33
6
Showing 10 of 22 rows

Other info

Code

Follow for update