WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
About
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | Delta Threshold Accuracy (1.25)53.7 | 193 | |
| Camera pose estimation | Sintel | ATE0.225 | 192 | |
| Camera pose estimation | TUM-dynamic | ATE0.07 | 163 | |
| Video Depth Estimation | KITTI | Abs Rel0.201 | 126 | |
| Camera pose estimation | ScanNet | RPE (t)0.02 | 119 | |
| Video Depth Estimation | BONN | AbsRel7.1 | 116 | |
| 3D Reconstruction | 7 Scenes | -- | 94 | |
| Video Depth Estimation | Sintel (test) | Delta 1 Accuracy50.6 | 61 | |
| Camera pose estimation | TUM | ATE0.074 | 55 | |
| Video Depth Estimation | TUM dynamics | Abs Rel0.177 | 53 |