Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

About

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

Lukas H\"ollein, Alja\v{z} Bo\v{z}i\v{c}, Norman M\"uller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollh\"ofer, Matthias Nie{\ss}ner• 2024

Related benchmarks

TaskDatasetResultRank
Image-to-3D GenerationSynthetic 3D Objects (test)
Ewarp0.0036
6
Single-image reconstructionCO3D v2 (test)
PSNR (Teddybear)21.98
3
Unconditional GenerationCO3D Teddybear v2 (test)
FID49.39
3
Unconditional GenerationCO3D Hydrant v2 (test)
FID46.45
3
Unconditional GenerationCO3D Donut v2 (test)
FID68.86
3
Unconditional GenerationCO3D Apple v2 (test)
FID56.85
3
Showing 6 of 6 rows

Other info

Follow for update