Quantized Visual Geometry Grounded Transformer

About

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu• 2025

Related benchmarks

Task	Dataset	Result
3D Reconstruction	7 Scenes	--	161
Camera pose estimation	CO3D v2	AUC@3096.84	132
3D Reconstruction	Neural RGB-D (NRGBD)	Acc Mean0.019	88
Camera pose estimation	CO3D v2 (test)	AUC@3089.6	61
Point Cloud Reconstruction	7 Scenes	--	58
Point Map Estimation	ETH3D	NC Mean0.842	50
Point Map Estimation	DTU	Accuracy (Mean)1.292	42
Camera pose estimation	RE10K	AUC@3085.16	30
Pointmap Regression	DTU	Mean Accuracy1.182	26
Depth Estimation	KITTI Video	AbsRel0.054	11

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord