Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization
About
This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cinematic Story Generation | ViStoryBench | CSD (Cross)0.325 | 24 | |
| Story Visualization | StorySalon long stories (test) | CLIP-T0.318 | 13 | |
| Story Visualization | StorySalon regular-length (test) | CLIP-T0.31 | 10 | |
| Regular-Length Story Visualization | StoryGen Regular-Length Story Visualization (Human Evaluation) | Alignment4.06 | 8 | |
| Long Story Visualization | StoryGen Human Evaluation Set Long Story Visualization | Alignment4.35 | 7 | |
| Subject-consistent image generation | StoryGen Human Evaluation Set Subject-Consistent Image Generation | Alignment4.2 | 6 | |
| Subject-consistent image generation | Subject-consistent image generation benchmark (test) | CLIP-T Score0.332 | 6 | |
| Story Generation | Story Generation Evaluation Set | Text Alignment81.08 | 5 | |
| Story Generation | ConsiStory-Human 1.0 (test) | CLIP-T Score34.3 | 5 | |
| Story-consistent Image Generation | User Study 20 story-based scenarios | Text Alignment (%)5 | 5 |