Monocular Dynamic View Synthesis: A Reality Check
About
We study the recent progress on dynamic view synthesis (DVS) from monocular video. Though existing approaches have demonstrated impressive results, we show a discrepancy between the practical capture process and the existing experimental protocols, which effectively leaks in multi-view signals during training. We define effective multi-view factors (EMFs) to quantify the amount of multi-view signal present in the input capture sequence based on the relative camera-scene motion. We introduce two new metrics: co-visibility masked image metrics and correspondence accuracy, which overcome the issue in existing protocols. We also propose a new iPhone dataset that includes more diverse real-life deformation sequences. Using our proposed experimental protocol, we show that the state-of-the-art approaches observe a 1-2 dB drop in masked PSNR in the absence of multi-view cues and 4-5 dB drop when modeling complex motion. Code and data can be found at https://hangg7.com/dycheck.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | iPhone DyCheck 7 scenes 2x resolution | mPSNR16.96 | 31 | |
| 4D Reconstruction | DyCheck (test) | mPSNR16.96 | 21 | |
| Novel View Synthesis | DyCheck (test) | mPSNR16.96 | 15 | |
| Novel View Synthesis | Nvidia Dataset | PSNR23.241 | 15 | |
| Novel View Synthesis | iPhone dataset (test) | Mean CLIP-I86.04 | 13 | |
| Dynamic View Synthesis | DyCheck 5 scenes, 1x resolution 1.0 (test) | mLPIPS0.55 | 11 | |
| Novel View Synthesis | DyCheck 1.0 (novel view) | PSNR15.6 | 9 | |
| Novel View Synthesis | iPhone dataset Block | CLIP Image Similarity0.8873 | 7 | |
| Novel View Synthesis | iPhone (Apple) | CLIP-I0.8275 | 7 | |
| Novel View Synthesis | iPhone dataset Mean | CLIP-I86.04 | 7 |