BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
About
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy42.5 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy19.6 | 1043 | |
| Visual Question Answering | GQA | Accuracy41 | 963 | |
| Multimodal Understanding | SEED-Bench | Accuracy46.4 | 203 | |
| Video Question Answering | MSVD | Accuracy20.3 | 100 | |
| Subject-driven image generation | DreamBench | DINO Score67 | 62 | |
| Large Multimodal Model Evaluation | MM-Vet | Average Score22.4 | 58 | |
| Video Question Answering | MSRVTT | Accuracy10.3 | 46 | |
| Subject-driven generation | DreamBench (test) | DINO Score0.67 | 25 | |
| Consistent Text-to-Image Generation | ConsiStory+ (test) | CLIP-T0.8187 | 23 |