Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

About

Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes. Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates. However, these representations are not efficient enough for an agent to perform actions to arrive the target location. As knowledge provides crucial information which is complementary to visible content, in this paper, we propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability. Specifically, we first retrieve facts (i.e., knowledge described by language descriptions) for the navigation views based on local regions from the constructed knowledge base. The retrieved facts range from properties of a single object (e.g., color, shape) to relationships between objects (e.g., action, spatial position), providing crucial information for VLN. We further present the KERM which contains the purification, fact-aware interaction, and instruction-guided aggregation modules to integrate visual, history, instruction, and fact features. The proposed KERM can automatically select and gather crucial and relevant cues, obtaining more accurate action prediction. Experimental results on the REVERIE, R2R, and SOON datasets demonstrate the effectiveness of the proposed method.

Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, Shuqiang Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)71.95
260
Vision-and-Language NavigationREVERIE (val unseen)
SPL35.38
129
Vision-Language NavigationR2R (val seen)
Success Rate (SR)79.73
120
Vision-Language NavigationR2R Unseen (test)
SR69.73
116
Vision-and-Language NavigationREVERIE Unseen (test)
Success Rate (SR)52.43
40
Remote GroundingREVERIE Unseen (test)
RGS32.69
33
Vision-and-Language NavigationREVERIE seen (val)
SR76.88
28
Remote GroundingREVERIE unseen (val)
RGS34.51
22
Vision-and-Language NavigationSOON (val unseen)
SPL23.16
16
Remote GroundingREVERIE (val seen)
RGS61
15
Showing 10 of 10 rows

Other info

Code

Follow for update