Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

About

This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou• 2024

Related benchmarks

Task	Dataset	Result
Cinematic Story Generation	ViStoryBench	CSD (Cross)0.325	24
Visual Storytelling	ViStoryBench Lite 2025	CSD (Cross)0.518	21
Story Visualization	StorySalon long stories (test)	CLIP-T0.318	13
Story Visualization	StorySalon regular-length (test)	CLIP-T0.31	10
Video Generation	FilMaster evaluation suite	Script Faithfulness (SF)3.75	9
Regular-Length Story Visualization	StoryGen Regular-Length Story Visualization (Human Evaluation)	Alignment4.06	8
Long Story Visualization	StoryGen Human Evaluation Set Long Story Visualization	Alignment4.35	7
Reference-to-Shot Storyboard Synthesis	DreamShot (test)	CIDS (Character) Self42.2	7
Subject-consistent image generation	StoryGen Human Evaluation Set Subject-Consistent Image Generation	Alignment4.2	6
Subject-consistent image generation	Subject-consistent image generation benchmark (test)	CLIP-T Score0.332	6

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord