LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
About
A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vision-and-Language Navigation | R2R (val unseen) | Success Rate (SR)57 | 448 | |
| Vision-Language Navigation | RxR (val-unseen) | Success Rate (SR)52.1 | 62 | |
| Video Visual Question Answering | VSI-Bench | ACC (MCA)43.5 | 28 | |
| Vision-Language Navigation with Reasoning | MindCraft (test) | QA Accuracy65.3 | 11 |