Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vision-Language Navigation with Energy-Based Policy

About

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

Rui Liu, Wenguan Wang, Yi Yang• 2024

Related benchmarks

TaskDatasetResultRank
Vision-Language NavigationR2R-CE (val-unseen)
Success Rate (SR)58
266
Vision-Language NavigationRxR-CE (val-unseen)
SR55.3
172
Vision-and-Language NavigationREVERIE (val unseen)
SPL33.8
129
Vision-and-Language NavigationR2R-CE (test-unseen)
SR56
50
Vision-and-Language NavigationR2R-CE (val-seen)
SR68
49
Vision-and-Language NavigationREVERIE Unseen (test)
Success Rate (SR)53.19
40
Vision-Language NavigationR2R unseen v1.0 (val)
SR74
24
Vision-Language NavigationR2R 1 (test unseen)
Success Rate0.71
18
Vision-and-Language NavigationRxR-CE seen (val)
NE5.1
13
Showing 9 of 9 rows

Other info

Code

Follow for update