SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

About

We present SpatialMem, a memory-centric system for long-horizon, language-grounded retrieval and QA from egocentric video, where metric 3D serves as an interpretable indexing scaffold rather than an explicit mapping objective. Starting from casually captured egocentric RGB video, SpatialMem builds a metric-aligned spatial scaffold for indoor scenes, detects structural 3D anchors (walls, doors, windows) as first-layer support, and populates a hierarchical memory with open-vocabulary object nodes that link evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates for compact storage and fast retrieval. This design enables interpretable, spatially grounded queries over relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided retrieval/QA and offline navigation-style guidance over a prebuilt memory, without specialized sensors. Experiments on one public Replica scene and two real-world egocentric indoor scenes show that SpatialMem maintains stable layout reasoning, offline guidance, and hierarchical retrieval across these evaluated scenes despite increasing clutter and occlusion. A compact ablation further shows that the two-layer description memory improves path-level grounding, while moderate scale perturbation causes only limited degradation. These results position SpatialMem as an efficient and extensible memory interface for spatially grounded long-horizon video understanding.

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen• 2026

Related benchmarks

Task	Dataset	Result
Description Quality	Scene Simple room 1	Color Accuracy82	8
Description Quality	Scene 3 Laboratory storage 1.0 (overall)	Color Accuracy78	8
Description Quality	Scene 2 Suite main room	Color Accuracy79	8
Navigation	Scene 2 Suite main room	SRnav0.8	8
Object Retrieval	Scene Simple room 1 (overall)	SRobj83	8
Object Retrieval	Scene 3 Laboratory storage 1.0	SRobj72	8
Object Retrieval	Scene 2 Suite main room	SRobj78	8
Instruction-based Navigation	Scene 3 Laboratory storage 1.0 (overall)	SRnav54	8
Navigation	Scene Simple room 1 (overall)	Success Rate (SR)77	8
Relative Position Reasoning	Scene 3 Laboratory storage 1.0 (overall)	Accrel74	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord