Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
About
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $\pi^3$ , and MapAnything, while substantially improving scalability to large image collections.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | Delta Threshold Accuracy (1.25)63.9 | 235 | |
| Camera pose estimation | TUM-dynamic | ATE0.0125 | 205 | |
| Video Depth Estimation | BONN | AbsRel5.7 | 131 | |
| Camera pose estimation | CO3D v2 | AUC@3097.22 | 117 | |
| Point Map Estimation | 7 Scenes | Accuracy (Mean)1.74 | 69 | |
| Multi-View Reconstruction | DTU | Chamfer Distance1.1908 | 64 | |
| Multi-View Reconstruction | CO3D v2 | AUC@300.9722 | 64 | |
| 3D Reconstruction | NRGBD | Accuracy Mean4.1 | 63 | |
| Relative Pose Estimation | ScanNet 1500 pairs (test) | AUC@5°35.13 | 56 | |
| 3D Reconstruction | DTU | Chamfer Distance1.332 | 55 |