Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
About
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | Sintel 14-sequence | ATE0.065 | 15 | |
| Camera Coordinate 3D tracking | TAPVid-3D ADT (test) | AJ0.307 | 9 | |
| Camera Coordinate 3D tracking | TAPVid-3D PStudio (test) | AJ0.372 | 9 | |
| Camera Coordinate 3D tracking | TAPVid-3D DriveTrack (test) | AJ0.257 | 9 | |
| World Coordinate 3D tracking | TAPVid-3D DriveTrack (test) | APD3D0.47 | 7 | |
| World Coordinate 3D tracking | TAPVid-3D ADT (test) | APD3D0.319 | 7 | |
| 3D Point Cloud Reconstruction | MPI Sintel | L1 Error0.768 | 6 | |
| 3D Point Cloud Reconstruction | ScanNet | L1 Error0.028 | 6 | |
| Camera pose estimation | ScanNet static indoor scenes | ATE0.014 | 6 | |
| Camera pose estimation | Re10K static indoor scenes | Pose AUC83.5 | 6 |