The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
About
Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Consistent Text-to-Image Generation | ConsiStory+ (test) | CLIP-T0.7614 | 23 | |
| Multi-frame visual story generation | ConsiStory+ | CLIP-T76.14 | 12 | |
| Consistent Text-to-Image Generation | ConsiStory+ evaluation prompts | Human Preference Rate0.1 | 8 |