Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

About

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Potential Target Driven (PTD) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D datasets demonstrate the superiority of our TopV-Nav.

Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Huimin Ma, Shifeng Zhang, Xu Zhou, Si Liu• 2024

Related benchmarks

TaskDatasetResultRank
ObjectGoal NavigationMP3D (val)
Success Rate35.2
68
Object NavigationHM3D v1 (val)
SR52
32
Object Goal NavigationHM3D 0.1
SR53
18
Object Goal NavigationMP3D
SR35.2
13
Showing 4 of 4 rows

Other info

Follow for update