Stable Virtual Camera: Generative View Synthesis with Diffusion Models

About

We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. Project page with code and model: https://stable-virtual-camera.github.io/.

Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani• 2025

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	Tanks&Temples (test)	--	289
Monocular Depth Estimation	NYU V2	Delta 1 Acc57.4	174
Novel View Synthesis	RE10K	SSIM77.7	161
Novel View Synthesis	LLFF (test)	PSNR15.6	96
Novel View Synthesis	ScanNet++	PSNR11.71	74
Monocular Depth Estimation	BONN	Delta 1.25 Accuracy61.8	60
Novel View Synthesis	T&T small-viewpoint set (O)	PSNR18.85	44
Novel View Synthesis	RE10K Small	PSNR14.51	38
Novel View Synthesis	DL3DV 6view	PSNR17.98	34
New View Synthesis	T&T	LPIPS0.238	33

Showing 10 of 152 rows

...

Other info

Follow for update

@wizwand_team Discord