GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation
About
Language-goal aerial navigation requires UAVs to localize targets in the complex outdoors, such as urban blocks based on textual instructions. The indoor methods are often hard to scale to urban scenes due to ambiguous objects, limited visual field, and spatial reasoning. In this work, we propose GeoNav, a multi-modal agent for long-range aerial navigation with geospatial awareness. GeoNav operates in three phases-landmark navigation, target search, and precise localization-mimicking human coarse-to-fine spatial reasoning patterns. To support such reasoning, it dynamically builds dual-scale spatial representations. The first is a global but schematic cognitive map, which fuses prior geographic knowledge and embodied visual cues into a top-down and explicit annotated form. It enables fast navigation to the landmark region via intuitive map-based reasoning. The second is a local but delicate scene graph representing hierarchical spatial relationships between landmarks and objects, utilized for accurate target localization. On top of the structured memory, GeoNav employs a spatial chain-of-thought mechanism to enable MLLMs with efficient and interpretable action-making across stages. On the CityNav benchmark, GeoNav surpasses the current SOTA up to 18.4% in success rate and significantly eliminates navigation error. The ablation studies highlight the importance of each module, positioning structured spatial perception as the key to advanced UAV navigation. Published in Pattern Recognition, 2026.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Navigation | CityNav (test unseen) | Navigation Error (NE)73.5 | 14 | |
| Navigation | CityNav unseen (val) | Navigation Error (NE)64.1 | 14 | |
| Navigation | CityNav seen (val) | Navigation Error (NE)58.6 | 14 | |
| Vision-Language Navigation | CityNav Easy | Navigation Error (NE)59.86 | 6 | |
| Vision-Language Navigation | CityNav Medium | Navigation Error (Path Length)53.8 | 6 | |
| Vision-Language Navigation | CityNav Hard | Navigation Error (NE)68.9 | 6 |