StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

About

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou• 2024

Related benchmarks

Task	Dataset	Result
Cinematic Story Generation	ViStoryBench	CSD (Cross)0.34	24
Consistent Text-to-Image Generation	ConsiStory+ (test)	CLIP-T0.8877	23
Visual Storytelling	ViStoryBench Lite 2025	CSD (Cross)0.409	21
Visual Storytelling Consistency	ViStoryBench	CSD (Self)63.5	13
Story Visualization	StorySalon long stories (test)	CLIP-T0.315	13
Multi-frame visual story generation	ConsiStory+	CLIP-T88.77	12
Story Visualization	StorySalon regular-length (test)	CLIP-T0.311	10
Multi-shot Video Generation	90 prompts evaluation suite	Type Accuracy52.22	9
Multi-shot Cinematic Video Generation	Multi-shot Cinematic Video Generation (test)	AQ (Aesthetic Quality)69.41	9
Story Visualization	LogicTale (test)	Instance Consistency (VL)3.07	9

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord