FrameVGGT: Coherence-Preserving Memory for Bounded Streaming Geometry

About

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We study bounded-memory streaming geometry from the perspective of memory organization: unlike language modeling, where useful information can often be compressed at token level, geometry-driven inference relies on coherent and mutually compatible observations across views. Under fixed memory budgets, retaining history as isolated entries can progressively fragment the geometric context needed for stable long-horizon matching and fusion. We therefore propose \textbf{FrameVGGT}, a bounded-memory framework that maintains a fixed-capacity set of complementary memory units for streaming geometry. In our implementation, each unit is instantiated as a frame-wise KV segment summarized by a compact key-space prototype, together with a sparse anchor tier for persistent long-range references. Across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded budgets while maintaining more stable geometry over long streams.

Zhisong Xu, Takeshi Oishi• 2026

Related benchmarks

Task	Dataset	Result
3D Reconstruction	7 Scenes	Completion2	161
Video Depth Estimation	BONN	AbsRel5.12	139
3D Reconstruction	Neural RGB-D (NRGBD)	Acc Mean0.054	88
Camera pose estimation	TUM	ATE0.0385	65

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord