AVGGT: Rethinking Global Attention for Accelerating VGGT

About

Models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves substantial inference acceleration across different context lengths, yielding about $2\times$ speedup at 100 frames, $4$--$5\times$ at 300 frames, and $8$--$10\times$ at 800 frames, while matching or slightly improving the accuracy of the original models and remaining robust in extremely dense multi-view settings where prior sparse-attention baselines fail.

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang• 2025

Related benchmarks

Task	Dataset	Result
3D Reconstruction	7 Scenes	Completion14.4	161
Camera pose estimation	CO3D v2	AUC@3089.27	132
Point Cloud Reconstruction	7 Scenes	Inference Time (s)20.6	58
3D Reconstruction	ETH3D	F1 Score66.5	50
Pose Estimation	HiRoom	AUC@349.32	47
Camera pose estimation	RealEstate10K	AUC@3085.45	46
Camera pose estimation	ScanNet++	AUC @ 30°92.68	32
Camera pose estimation	7Scenes	AUC@30.2505	32
Camera pose estimation	RE10K	AUC@3089.06	30
Point Cloud Reconstruction	ScanNet-50 (300 frames)	Chamfer Distance (CD)0.43	21

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord