VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames
About
We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | ScanNet | PSNR24.54 | 130 | |
| Novel View Synthesis | ACID (test) | PSNR22.57 | 39 | |
| Novel View Synthesis | RE10K 8 views | PSNR24.502 | 22 | |
| Camera Pose Prediction | ScanNet (test) | ATE0.075 | 18 | |
| Novel View Synthesis | ScanNet 8 views | PSNR23.656 | 17 | |
| Novel View Synthesis | RE10K 4 views | PSNR24.65 | 15 | |
| Novel View Synthesis | ScanNet 4 views | PSNR26.673 | 15 | |
| Novel View Synthesis | RE10K 16 views | LPIPS0.384 | 7 | |
| Novel View Synthesis | RE10K 24 views | LPIPS0.443 | 7 | |
| Novel View Synthesis | RE10k 2-views setup | PSNR25.038 | 6 |