VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

About

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Sven Elflein, Ruilong Li, S\'ergio Agostinho, Zan Gojcic, Laura Leal-Taix\'e, Qunjie Zhou, Aljosa Osep• 2026

Related benchmarks

Task	Dataset	Result	Rank
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)58.1		235
Video Depth Estimation	BONN	AbsRel6.3		139

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord