Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

About

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Kaihua Chen, Tarasha Khurana, Deva Ramanan• 2025

Related benchmarks

TaskDatasetResultRank
Narrow Dynamic View SynthesisDyCheck iPhone 1.0 (test)
PSNR16.94
7
Narrow Dynamic View SynthesisKubric-4D gradual 1.0 (test)
PSNR22.63
7
Novel View SynthesisDroid, BridgeData V2, and RoboCoin (test)
PSNR11.88
7
Click BellRoboTwin 0° viewpoint
Success Rate35
6
Click AlarmclockRoboTwin 0° viewpoint
Success Rate68
6
Narrow Dynamic View SynthesisParDom-4D gradual 1.0 (test)
PSNR24.34
6
Generative Video SynthesisRoboTwin
PSNR (dB)17.154
5
Monocular Dynamic 3D ReconstructionTruebones
mPSNR21.062
4
Monocular Dynamic 3D ReconstructionPanoptic Studio sports
mPSNR16.904
4
Showing 9 of 9 rows

Other info

Follow for update