Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

About

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $\pi^3$ , and MapAnything, while substantially improving scalability to large image collections.

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe• 2025

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)63.9
235
Camera pose estimationTUM-dynamic
ATE0.0125
205
Video Depth EstimationBONN
AbsRel5.7
131
Camera pose estimationCO3D v2
AUC@3097.22
117
Point Map Estimation7 Scenes
Accuracy (Mean)1.74
69
Multi-View ReconstructionDTU
Chamfer Distance1.1908
64
Multi-View ReconstructionCO3D v2
AUC@300.9722
64
3D ReconstructionNRGBD
Accuracy Mean4.1
63
Relative Pose EstimationScanNet 1500 pairs (test)
AUC@5°35.13
56
3D ReconstructionDTU
Chamfer Distance1.332
55
Showing 10 of 27 rows

Other info

Follow for update