LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

About

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, Xiao-Xiao Long• 2025

Related benchmarks

Task	Dataset	Result
Camera pose estimation	TUM-dynamic	ATE0.0145	205
3D Reconstruction	7 Scenes	--	128
Camera pose estimation	CO3D v2	AUC@3083.2	117
3D Reconstruction	Neural RGB-D (NRGBD)	Acc Mean0.031	88
Point Map Estimation	7 Scenes	Accuracy (Mean)1.85	69
Surface Reconstruction	Tanks&Temples	Mean0.57	57
Point Cloud Reconstruction	7 Scenes	Inference Time (s)4.5	46
Point Map Estimation	NRGBD	Mean Accuracy0.0264	32
Camera pose estimation	7 Scenes	ATE0.0798	14
Camera pose estimation	Neural RGB-D	ATE0.0531	14

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord