Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

About

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang• 2026

Related benchmarks

Task	Dataset	Result
3D Reconstruction	7 Scenes	Completion2.52	161
Depth Estimation	KITTI 110 frames	AbsRel17.54	75
Depth Estimation	Sintel ~50 frames	AbsRel0.325	70
Video Depth Estimation	Bonn 110 frames	AbsRel5.7	69
3D Reconstruction	NRGBD	Normalized Score (NC)66.49	66
Video Depth Estimation	Bonn 400 frames	Abs Rel0.0699	15
Video Depth Estimation	Bonn 300 frames	Abs Rel0.0698	13
Video Depth Estimation	Bonn 500 frames	Abs Rel0.0669	9
Video Depth Estimation	Bonn 200 frames	Abs Rel0.0604	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord