One Diffusion to Generate Them All

About

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion

Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	GenEval Score65	459
Monocular Depth Estimation	NYU v2 (test)	Abs Rel6.8	327
Image Editing	PIE-Bench	PSNR27.49	257
Depth Estimation	KITTI	--	184
Depth Estimation	ScanNet	AbsRel0.094	133
Depth Estimation	DIODE	Delta-1 Accuracy66.1	92
Subject-driven generation	DreamBench	DINO Score0.692	30
Depth Estimation	ETH3D	AbsRel0.072	25
Depth Estimation	NYU	AbsRel0.087	20
Monocular Depth Estimation	DIODE (test)	AbsRel29.4	17

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord