AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
About
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pairwise Camera Pose Estimation | Ground-Aerial | RRA @ 5°5.60e+3 | 10 | |
| Novel View Synthesis | Google Earth (pseudo-synthetic images) | LPIPS0.359 | 7 | |
| Camera pose estimation | CrossGeo | Mean Accuracy2.1334 | 5 | |
| Cross-view Localization | CrossGeo Ground Camera 1.0 (test) | Mean Distance (m)22.22 | 5 | |
| Ground Camera Localization | AnyVisLoc | Mean Translation Error (Meter)48.9 | 5 | |
| UAV Camera Localization | AnyVisLoc | Mean Translation Error (m)40.17 | 5 | |
| Cross-view Localization | CrossGeo UAV Camera 1.0 (test) | Mean Distance Error (m)12.77 | 5 | |
| 3D Geometry Prediction | Ground-Aerial | Delta Error (0.5m)32.77 | 4 | |
| Novel View Synthesis | Real-world aerial-ground pairs | DreamSim0.442 | 2 |