MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

About

This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e.g., perspective crops from a panorama or multi-view images given depth maps and poses). Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-to-image diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh.

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, Yasutaka Furukawa• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Panorama Generation	Structured3D	CLIP Score31.28	7
Texture Synthesis	3D-Front (test)	CLIP Score18.47	7
Text-to-Panorama Generation	PEBench (test)	FID96.07	7
Panorama Generation	Matterport3D (test)	FID21.44	5
Multi-view depth-to-image generation	ScanNet (test)	FID23.1	3

Showing 5 of 5 rows

Other info

Code

Follow for update

@wizwand_team Discord