MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
About
This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e.g., perspective crops from a panorama or multi-view images given depth maps and poses). Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-to-image diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Texture Synthesis | 3D-Front (test) | CLIP Score18.47 | 7 | |
| Text-to-Panorama Generation | PEBench (test) | FID96.07 | 7 | |
| Panorama Generation | Matterport3D (test) | FID21.44 | 5 | |
| Multi-view depth-to-image generation | ScanNet (test) | FID23.1 | 3 |