SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
About
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Description Quality | Scene Simple room 1 | Color Accuracy82 | 8 | |
| Description Quality | Scene 3 Laboratory storage 1.0 (overall) | Color Accuracy78 | 8 | |
| Description Quality | Scene 2 Suite main room | Color Accuracy79 | 8 | |
| Navigation | Scene 2 Suite main room | SRnav0.8 | 8 | |
| Object Retrieval | Scene Simple room 1 (overall) | SRobj83 | 8 | |
| Object Retrieval | Scene 3 Laboratory storage 1.0 | SRobj72 | 8 | |
| Object Retrieval | Scene 2 Suite main room | SRobj78 | 8 | |
| Instruction-based Navigation | Scene 3 Laboratory storage 1.0 (overall) | SRnav54 | 8 | |
| Navigation | Scene Simple room 1 (overall) | Success Rate (SR)77 | 8 | |
| Relative Position Reasoning | Scene 3 Laboratory storage 1.0 (overall) | Accrel74 | 8 |