ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

About

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv Batra• 2022

Related benchmarks

Task	Dataset	Result
Object Goal Navigation	MP3D	SR15.3	129
Object Navigation	HM3D	Success Rate (SR)25.5	110
Object Goal Navigation	HM3D	Success Rate25.5	80
ObjectGoal Navigation	MP3D (val)	Success Rate15.3	68
Object Goal Navigation	HM3D v1 (val)	Success Rate (SR)25.5	65
Object Goal Navigation	HM3D 0.1	SR25.5	35
Object Navigation	HM3D v1 (val)	SR25.5	32
Object Goal Navigation	MP3D 1.0 (val)	SR15.3	30
Image-Goal Navigation	Gibson (A)	Success Rate36.9	22
Object Goal Navigation	HM3D (test)	SR25.5	22

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord