FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
About
Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier for difficult long-horizon intervals. Across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation, FrameVGGT achieves favorable accuracy-memory trade-offs under bounded memory while maintaining more stable geometry over long streams.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | BONN | AbsRel5.12 | 131 | |
| 3D Reconstruction | 7 Scenes | Accuracy Median1.3 | 128 | |
| 3D Reconstruction | Neural RGB-D (NRGBD) | Acc Mean0.054 | 88 | |
| Camera pose estimation | TUM | ATE0.0385 | 59 |