Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

About

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang• 2026

Related benchmarks

TaskDatasetResultRank
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)57
448
Vision-Language NavigationRxR (val-unseen)
Success Rate (SR)52.1
62
Video Visual Question AnsweringVSI-Bench
ACC (MCA)43.5
28
Vision-Language Navigation with ReasoningMindCraft (test)
QA Accuracy65.3
11
Showing 4 of 4 rows

Other info

Follow for update