Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

About

Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.

Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu• 2026

Related benchmarks

TaskDatasetResultRank
Object Goal NavigationHM3D 0.1
SR68
18
Object Goal NavigationMP3D
SR50.8
13
Object Goal NavigationHM3D OVON
SR45.7
11
Open-Vocabulary Object Goal NavigationHM3D OVON (test)
SR45.7
7
Object Goal NavigationHM3D 0.2
SR74.8
5
Open-Vocabulary Object Goal NavigationMP3D (test)
Success Rate (SR)50.8
4
Showing 6 of 6 rows

Other info

Follow for update