Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

About

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.

Xuesong Zhang, Yunbo Xu, Jia Li, Ruonan Liu, Zhenzhen Hu• 2024

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)52.7
266
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)73
260
Vision-and-Language NavigationREVERIE (val unseen)
SPL39.21
129
Vision-Language NavigationR2R Unseen (test)
SR72.5
116
Vision-and-Language NavigationR2R-CE (test-unseen)
SR50.9
50
Vision-and-Language NavigationREVERIE Unseen (test)
Success Rate (SR)54.3
40
Vision-and-Language NavigationSOON (val unseen)
SPL30.8
16
Vision-and-Language NavigationSOON (test-unseen)
SPL25.4
5
Showing 8 of 8 rows

Other info

Code

Follow for update