Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

About

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen• 2025

Related benchmarks

TaskDatasetResultRank
Object Goal NavigationHM3D-OVON Seen (val)
SR48.3
44
Multi-Modal Lifelong NavigationGOAT-Bench unseen (val)
SR52
22
Object NavigationHM3D ObjNav
Success Rate (SR)63
13
Showing 3 of 3 rows

Other info

Follow for update