LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

About

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang• 2026

Related benchmarks

Task	Dataset	Result
Vision-and-Language Navigation	R2R (val unseen)	Success Rate (SR)57	476
Vision-Language Navigation	RxR (val-unseen)	Success Rate (SR)52.1	62
Video Visual Question Answering	VSI-Bench	ACC (MCA)43.5	28
Vision-Language Navigation with Reasoning	MindCraft (test)	QA Accuracy65.3	11

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord