Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

About

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Kaihua Chen, Tarasha Khurana, Deva Ramanan• 2025

Related benchmarks

Task	Dataset	Result
Narrow Dynamic View Synthesis	DyCheck iPhone 1.0 (test)	PSNR16.94	7
Narrow Dynamic View Synthesis	Kubric-4D gradual 1.0 (test)	PSNR22.63	7
Novel View Synthesis	Droid, BridgeData V2, and RoboCoin (test)	PSNR11.88	7
Click Bell	RoboTwin 0° viewpoint	Success Rate35	6
Click Alarmclock	RoboTwin 0° viewpoint	Success Rate68	6
Narrow Dynamic View Synthesis	ParDom-4D gradual 1.0 (test)	PSNR24.34	6
Generative Video Synthesis	RoboTwin	PSNR (dB)17.154	5
Camera-Controllable Video Generation	DAVIS-10	Subject Consistency91.34	5
Monocular Dynamic 3D Reconstruction	Truebones	mPSNR21.062	4
Monocular Dynamic 3D Reconstruction	Panoptic Studio sports	mPSNR16.904	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord