VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

About

We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, Peidong Liu• 2025

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	ScanNet	PSNR24.54	132
Novel View Synthesis	ACID (test)	PSNR22.57	113
Novel View Synthesis	RE10K 8 views	PSNR24.502	22
Camera Pose Prediction	ScanNet (test)	ATE0.075	18
Novel View Synthesis	ScanNet 8 views	PSNR23.656	17
Novel View Synthesis	RE10K 4 views	PSNR24.65	15
Novel View Synthesis	ScanNet 4 views	PSNR26.673	15
Novel View Synthesis	RE10k 2-views setup	PSNR25.038	10
Novel View Synthesis	RE10K 16 views	LPIPS0.384	7
Novel View Synthesis	RE10K 24 views	LPIPS0.443	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord