Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

About

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier for difficult long-horizon intervals. Across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation, FrameVGGT achieves favorable accuracy-memory trade-offs under bounded memory while maintaining more stable geometry over long streams.

Zhisong Xu, Takeshi Oishi• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationBONN
AbsRel5.12
131
3D Reconstruction7 Scenes
Accuracy Median1.3
128
3D ReconstructionNeural RGB-D (NRGBD)
Acc Mean0.054
88
Camera pose estimationTUM
ATE0.0385
59
Showing 4 of 4 rows

Other info

Follow for update