TTT3R: 3D Reconstruction as Test-Time Training
About
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | BONN | Relative Error (Rel)0.068 | 103 | |
| Camera pose estimation | Sintel | ATE0.201 | 92 | |
| Camera pose estimation | ScanNet | ATE RMSE (Avg.)0.064 | 61 | |
| Camera pose estimation | TUM dynamics | RRE0.38 | 57 | |
| Video Depth Estimation | Sintel (test) | Delta 1 Accuracy50 | 57 | |
| Camera Localization | 7 Scenes | Average Position Error (m)0.143 | 46 | |
| 3D Reconstruction | Neural RGB-D (NRGBD) | Acc Mean0.165 | 38 | |
| Video Depth Estimation | Bonn (test) | Abs Rel0.068 | 37 | |
| Object Tracking | Arctic Dataset | ATE RMSE (m)0.156 | 33 | |
| 3D Reconstruction | 7 Scenes | Accuracy Mean6.2 | 32 |