Training-Free Consistent Text-to-Image Generation

About

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon• 2024

Related benchmarks

Task	Dataset	Result
Consistent Text-to-Image Generation	ConsiStory+ (test)	CLIP-T0.8942	23
Story Visualization	StorySalon long stories (test)	CLIP-T0.316	13
Multi-frame visual story generation	ConsiStory+	CLIP-T87.69	12
Story Visualization	LogicTale (test)	Instance Consistency (VL)3.23	9
Multi-character story generation	Multi-character story generation (test)	CLIP-T33.55	8
Visual Storytelling	ConsiStory+	CLIP-T Score0.8564	7
Multiple Subjects Story Customization	M2SB	CLIP-T Score0.302	6
Story Generation	M2SB	CLIP-T0.303	6
Visual Storytelling	FreeStoryBench Multi Character type-based referring prompts	CLIP-T0.8599	6
Visual Storytelling	FreeStoryBench Single Character type-based referring prompts	CLIP-T84.57	6

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord