Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

About

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $\pi^3$ , and MapAnything, while substantially improving scalability to large image collections.

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe• 2025

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)63.9	235
Camera pose estimation	TUM-dynamic	ATE0.0125	205
Video Depth Estimation	BONN	AbsRel5.7	131
Camera pose estimation	CO3D v2	AUC@3097.22	117
Point Map Estimation	7 Scenes	Accuracy (Mean)1.74	69
Multi-View Reconstruction	DTU	Chamfer Distance1.1908	64
Multi-View Reconstruction	CO3D v2	AUC@300.9722	64
3D Reconstruction	NRGBD	Accuracy Mean4.1	63
Relative Pose Estimation	ScanNet 1500 pairs (test)	AUC@5°35.13	56
3D Reconstruction	DTU	Chamfer Distance1.332	55

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord