GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
About
Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Reconstruction | 7 Scenes | Accuracy Median0.7 | 128 | |
| 3D Reconstruction | NRGBD | Accuracy Mean4.6 | 63 | |
| 3D Reconstruction | Bonn (test) | Abs Rel5.4 | 20 | |
| 3D Reconstruction | Long3D Classroom | Accuracy (Mean)33.2 | 7 | |
| 3D Reconstruction | Long3D Library | Acc (Mean)0.745 | 7 | |
| 3D Reconstruction | Long3D Academic Building | Accuracy (Mean)4.325 | 7 | |
| 3D Reconstruction | Long3D Dormitory | Accuracy (Mean)1.135 | 4 | |
| 3D Reconstruction | Long3D Badminton Court (6067 frames) | Mean Accuracy1.312 | 4 |