Matrix3D: Large Photogrammetry Model All-in-One
About
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Prediction | ETH3D | AbsRel19.7 | 37 | |
| Novel View Synthesis | Google Scanned Objects (GSO) (test) | PSNR19.941 | 24 | |
| Novel View Synthesis | Mip-NeRF 360 out-of-domain 3 | PSNR13.97 | 8 | |
| 3D Reconstruction | GSO (test) | Chamfer Distance (CD)0.058 | 8 | |
| Source View Depth Estimation | GSO (test) | Relative Error (Rel)8.782 | 8 | |
| Novel View Synthesis | RealEstate10K 58 (test) | PSNR14.49 | 8 | |
| Novel View Synthesis | DL3DV 27 (test) | PSNR13.33 | 8 | |
| Novel View Depth Estimation | GSO (test) | Relative Error8.897 | 5 | |
| Pose Estimation | GSO (test) | RA@543.77 | 5 |