Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

About

We present SpatialMem, a memory-centric system for long-horizon, language-grounded retrieval and QA from egocentric video, where metric 3D serves as an interpretable indexing scaffold rather than an explicit mapping objective. Starting from casually captured egocentric RGB video, SpatialMem builds a metric-aligned spatial scaffold for indoor scenes, detects structural 3D anchors (walls, doors, windows) as first-layer support, and populates a hierarchical memory with open-vocabulary object nodes that link evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates for compact storage and fast retrieval. This design enables interpretable, spatially grounded queries over relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided retrieval/QA and offline navigation-style guidance over a prebuilt memory, without specialized sensors. Experiments on one public Replica scene and two real-world egocentric indoor scenes show that SpatialMem maintains stable layout reasoning, offline guidance, and hierarchical retrieval across these evaluated scenes despite increasing clutter and occlusion. A compact ablation further shows that the two-layer description memory improves path-level grounding, while moderate scale perturbation causes only limited degradation. These results position SpatialMem as an efficient and extensible memory interface for spatially grounded long-horizon video understanding.

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen• 2026

Related benchmarks

TaskDatasetResultRank
Description QualityScene Simple room 1
Color Accuracy82
8
Description QualityScene 3 Laboratory storage 1.0 (overall)
Color Accuracy78
8
Description QualityScene 2 Suite main room
Color Accuracy79
8
NavigationScene 2 Suite main room
SRnav0.8
8
Object RetrievalScene Simple room 1 (overall)
SRobj83
8
Object RetrievalScene 3 Laboratory storage 1.0
SRobj72
8
Object RetrievalScene 2 Suite main room
SRobj78
8
Instruction-based NavigationScene 3 Laboratory storage 1.0 (overall)
SRnav54
8
NavigationScene Simple room 1 (overall)
Success Rate (SR)77
8
Relative Position ReasoningScene 3 Laboratory storage 1.0 (overall)
Accrel74
8
Showing 10 of 12 rows

Other info

Follow for update