Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Faster VGGT with Block-Sparse Global Attention

About

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe• 2025

Related benchmarks

TaskDatasetResultRank
Camera pose estimationCO3D v2
AUC@3097.22
78
Multi-View ReconstructionDTU
Chamfer Distance1.1908
64
Multi-View ReconstructionCO3D v2
AUC@300.9722
64
Relative Pose EstimationScanNet 1500 pairs (test)
AUC@5°35.13
56
3D ReconstructionDTU
Chamfer Distance1.332
55
Pose EstimationRE10K--
35
Pose EstimationCO3D v2
AUC@3088.25
19
Point Map EstimationDTU (test)
Accuracy (Mean)1.966
15
Pose EstimationTanks & Temples long-sequence
RRA@567.85
10
Pointmap EstimationETH3D 32 (test)
Accuracy Mean86.1
8
Showing 10 of 10 rows

Other info

Follow for update