Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

About

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)73.1
235
Monocular Depth EstimationKITTI
Abs Rel0.063
220
Camera pose estimationTUM-dynamic
ATE0.012
205
Camera pose estimationSintel
ATE0.132
203
Monocular Depth EstimationNYU V2--
174
Video Depth EstimationKITTI
Abs Rel0.05
148
Video Depth EstimationBONN
AbsRel5.2
131
Monocular Depth EstimationSintel
Abs Rel0.268
127
Camera pose estimationCO3D v2
AUC@3088.76
117
Monocular Depth EstimationBONN
Delta 1.25 Accuracy97.3
60
Showing 10 of 30 rows

Other info

Follow for update