ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

About

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski• 2026

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)73.1	235
Monocular Depth Estimation	KITTI	Abs Rel0.063	220
Camera pose estimation	TUM-dynamic	ATE0.012	205
Camera pose estimation	Sintel	ATE0.132	203
Monocular Depth Estimation	NYU V2	--	192
Video Depth Estimation	KITTI	Abs Rel0.05	153
Monocular Depth Estimation	Sintel	Abs Rel0.268	142
Video Depth Estimation	BONN	AbsRel5.2	139
Camera pose estimation	CO3D v2	AUC@3088.76	132
Monocular Depth Estimation	BONN	Delta 1.25 Accuracy97.3	60

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord