WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
About
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | Sintel | ATE0.225 | 92 | |
| Camera pose estimation | ScanNet | ATE RMSE (Avg.)0.062 | 61 | |
| Video Depth Estimation | Sintel (test) | Delta 1 Accuracy50.6 | 57 | |
| Video Depth Estimation | Bonn (test) | Abs Rel0.07 | 37 | |
| Video Depth Estimation | KITTI (test) | Delta194.9 | 25 | |
| Camera pose estimation | TUM | ATE0.074 | 13 |