PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
About
Panoramic imagery offers a full 360{\deg} field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | Matterport3D | delta192.66 | 50 | |
| Depth Estimation | Stanford2D3D | Abs Rel0.0711 | 27 | |
| Depth Estimation | Pano3D GibsonV2 | Absolute Relative Error0.0833 | 24 | |
| Depth Estimation | PanoCity Outdoor | Abs Rel0.0196 | 12 | |
| Depth Estimation | Structured3D Indoor | Abs Rel Error4 | 12 | |
| Camera pose estimation | Matterport3D Indoor | AUC@3045.9 | 5 | |
| Camera pose estimation | Stanford2D3D Indoor | AUC@3055.6 | 5 | |
| Camera pose estimation | PanoCity Outdoor | AUC@300.949 | 5 | |
| Point Cloud Reconstruction | PanoCity | Accuracy Mean0.768 | 5 | |
| Point Cloud Reconstruction | Stanford2D3D v1 (test) | Accuracy Mean21.09 | 5 |