StoryMem: Multi-shot Long Video Storytelling with Memory

About

Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan• 2025

Related benchmarks

Task	Dataset	Result
Multi-Shot Audio-Video Generation	CineBench	Audio Quality (AQ)0.57	13
Multi-shot cinematic audio-video generation	CineBench Human Evaluation 1.0	Video Quality3.72	13
Long Video Extrapolation	FlintstonesSV, Pororo-SV, ActivityNet Captions, YouCook2, Shot2Story, and MovieNet Average	FVD286.1	10
Multi-shot Video Generation	GroundBench	ARC Score0.314	10
Story-driven Video Generation	ViStoryBench	Scene Score3.0333	6
Short Drama Generation	Short Drama Bench	Opening Hook Score2.28	6
Multi-shot Video Generation	Video Generation Multi-shot (test)	BG Consistency (VBench)91.5	6
Video Generation	VBench	Subject Consistency80.02	6
Multi-shot video storytelling	ST-Bench	Aesthetic Quality61.33	5
Multi-shot Video Generation	GroundBench (40 randomly sampled scripts)	Identity Score3.48	5

Showing 10 of 16 rows

Other info

GitHub

Follow for update

@wizwand_team Discord