Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

About

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user study, while being 6x faster.

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari• 2025

Related benchmarks

TaskDatasetResultRank
Stereo Video SynthesisStereo4D Parallel Format
MS-SSIM91.5
7
Stereoscopic Video GenerationStereo4D (test)
iSQoE0.515
7
Mono-to-stereo video conversionStereo4D (test)
PSNR24.6
6
Stereoscopic Video GenerationAVP (test)
iSQoE0.507
6
Stereoscopic Video GenerationiPhone (test)
iSQoE0.505
6
Monocular to Binocular Stereo Video ConversionSpatial Video dataset iPhone portion (test)
PSNR22.9
5
Mono-to-stereo video conversionApple Vision Pro Spatial Video (out-of-distribution)
PSNR24.4
5
Mono-to-stereo video conversionEgo4D (test)
PSNR18
5
3D Video GenerationiPhone and Apple Vision Pro (AVP) datasets
Equal Preference Count20
4
Showing 9 of 9 rows

Other info

Follow for update