Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

About

This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Cinematic Story GenerationViStoryBench
CSD (Cross)0.325
24
Story VisualizationStorySalon long stories (test)
CLIP-T0.318
13
Story VisualizationStorySalon regular-length (test)
CLIP-T0.31
10
Regular-Length Story VisualizationStoryGen Regular-Length Story Visualization (Human Evaluation)
Alignment4.06
8
Long Story VisualizationStoryGen Human Evaluation Set Long Story Visualization
Alignment4.35
7
Subject-consistent image generationStoryGen Human Evaluation Set Subject-Consistent Image Generation
Alignment4.2
6
Subject-consistent image generationSubject-consistent image generation benchmark (test)
CLIP-T Score0.332
6
Story GenerationStory Generation Evaluation Set
Text Alignment81.08
5
Story GenerationConsiStory-Human 1.0 (test)
CLIP-T Score34.3
5
Story-consistent Image GenerationUser Study 20 story-based scenarios
Text Alignment (%)5
5
Showing 10 of 10 rows

Other info

Follow for update