SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

About

We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. The framework employs a shared feature extraction backbone to jointly predict 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. To enable efficient and accurate pose estimation, we introduce a masked attention mechanism for target-view pose prediction and a reprojection loss that enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap. It also surpasses many methods that rely on geometric supervision in relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: https://ranrhuang.github.io/spfsplatv2/.

Ranran Huang, Krystian Mikolajczyk• 2025

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	RE10K	SSIM89.2	345
Novel View Synthesis	ACID	PSNR26.802	175
Novel View Synthesis	DTU	PSNR19.316	154
Novel View Synthesis	ScanNet++	PSNR23.072	93
Novel View Synthesis	RE10K (Medium)	PSNR26.03	57
Novel View Synthesis	RE10K (Average)	PSNR26.157	57
Pose Estimation	RE10K	AUC @ 5°0.645	41
Novel View Synthesis	RE10K Small	PSNR23.138	38
Novel View Synthesis	RE10K (small overlap)	PSNR23.456	32
Novel View Synthesis	RE10K large overlap	PSNR28.682	32

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord