Navigation World Models

About

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun• 2024

Related benchmarks

Task	Dataset	Result
Goal Conditioned Visual Navigation	SCAND	ATE1.28	18
Goal Conditioned Visual Navigation	RECON	ATE1.13	16
Visual generation	2D trajectory dataset	LPIPS0.377	16
Geometric Drift Evaluation	HuRON	Euclidean Distance (ED)8.99	15
Geometric Drift Evaluation	TartanDrive	Endpoint Distance (ED)6.41	15
Perceptual Drift	RECON	LPIPS0.33	15
Perceptual Drift	SCAND	LPIPS0.353	15
Perceptual Drift	TartanDrive	LPIPS0.381	15
Geometric Drift Evaluation	RECON	Euclidean Distance (ED)9.4	15
Perceptual Drift	HuRON	LPIPS0.445	15

Showing 10 of 104 rows

...

Other info

Code

Follow for update

@wizwand_team Discord