In-Context Learning Unlocked for Diffusion Models

About

We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Prompt-Diffusion.

Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, Mingyuan Zhou• 2023

Related benchmarks

Task	Dataset	Result
Controllable Image Generation	COCO (test)	Inference Latency (s)9.63	14
Image Manipulation	Image manipulation Few-shot (In Distribution)	CLIP-Dir17.13	7
Image Manipulation	Few-shot image manipulation (Out of Distribution)	CLIP Directional Score15.41	6
Conditional Image Generation (HED Edge)	COCO 5,000 samples 2017 (val)	FID59.4	6
Depth Estimation	Visual In-Context Learning (V-ICL) Benchmark	AbsRel0.16	5
Edge Detection	Visual In-Context Learning (V-ICL) Benchmark	RMSE35.88	5
Colorization	Visual In-Context Learning (V-ICL) Benchmark	FID179.2	5
Object Detection	PASCAL-5i	mIoU32.6	5
Surface Normal Estimation	Visual In-Context Learning (V-ICL) Benchmark	Median Angular Error97.27	5
Image Deraining	Visual In-Context Learning (V-ICL) Benchmark	PSNR8.67	5

Showing 10 of 28 rows

Other info

Code

Follow for update

@wizwand_team Discord