Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

About

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI
Abs Rel0.063
203
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)73.1
193
Camera pose estimationSintel
ATE0.132
192
Camera pose estimationTUM-dynamic
ATE0.012
163
Monocular Depth EstimationNYU V2--
131
Video Depth EstimationKITTI
Abs Rel0.05
126
Video Depth EstimationBONN
AbsRel5.2
116
Monocular Depth EstimationSintel
Abs Rel0.268
91
Camera pose estimationCO3D v2
AUC@3088.76
78
Point Cloud ReconstructionETH3D and DTU
Reconstruction Time (s)0.125
50
Showing 10 of 22 rows

Other info

Follow for update