Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

About

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

Yibin Zhao, Yihan Pan, Jun Nan, Wenli Yang, Liwei Chen, Jianjun Yi• 2026

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisTanks&Temples (test)
PSNR22.46
289
Novel View SynthesisReplica (test)
PSNR31.03
67
GS Depth RenderingReplica Dataset
RMSE0.043
54
GS Depth RenderingDTU Dataset
RMSE0.014
54
GS Depth RenderingTAT Dataset
RMSE0.145
54
Novel View SynthesisTAT Dataset (test)
PSNR22.46
30
Showing 6 of 6 rows

Other info

Follow for update