Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Continuous 3D Perception Model with Persistent State

About

We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: https://cut3r.github.io/

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, Angjoo Kanazawa• 2025

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI
Abs Rel0.097
161
Monocular Depth EstimationETH3D
AbsRel4.69
117
Monocular Depth EstimationNYU V2
Delta 1 Acc97.9
113
Video Depth EstimationSintel
Relative Error (Rel)0.417
109
Video Depth EstimationBONN
Relative Error (Rel)0.072
103
Monocular Depth EstimationDIODE
AbsRel5.93
93
Camera pose estimationSintel
ATE0.213
92
Camera pose estimationScanNet
ATE RMSE (Avg.)0.094
61
Camera pose estimationTUM dynamics
RRE0.451
57
Video Depth EstimationSintel (test)
Delta 1 Accuracy56
57
Showing 10 of 95 rows
...

Other info

Code

Follow for update