PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

About

Panoramic imagery offers a full 360{\deg} field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, Yujiao Shi• 2026

Related benchmarks

Task	Dataset	Result
Depth Estimation	Matterport3D	delta192.66	50
Depth Estimation	Stanford2D3D	Abs Rel0.0711	37
Depth Estimation	Pano3D GibsonV2	Absolute Relative Error0.0833	24
Depth Estimation	PanoCity Outdoor	Abs Rel0.0196	12
Depth Estimation	Structured3D Indoor	Abs Rel Error4	12
Camera pose estimation	Matterport3D Indoor	AUC@3045.9	5
Camera pose estimation	Stanford2D3D Indoor	AUC@3055.6	5
Camera pose estimation	PanoCity Outdoor	AUC@300.949	5
Point Cloud Reconstruction	PanoCity	Accuracy Mean0.768	5
Point Cloud Reconstruction	Stanford2D3D v1 (test)	Accuracy Mean21.09	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord