StoryMem: Multi-shot Long Video Storytelling with Memory
About
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Story-driven Video Generation | ViStoryBench | Scene Score3.0333 | 6 | |
| Short Drama Generation | Short Drama Bench | Opening Hook Score2.28 | 6 | |
| Multi-shot Video Generation | Video Generation Multi-shot (test) | BG Consistency (VBench)91.5 | 6 | |
| Video Generation | VBench | Subject Consistency80.02 | 6 | |
| Multi-shot video storytelling | ST-Bench | Aesthetic Quality61.33 | 5 | |
| Cross-shot consistency | Pillar 3 Cross-shot consistency | CS Consistency (Face)79.2 | 4 | |
| Video Generation | EntityBench Cross-shot 1.0 | Cross-shot Face Consistency79.2 | 4 | |
| Intra-shot prompt-following alignment | Pillar 2 Intra-shot prompt-following alignment | Intra-shot Character Presence84.9 | 4 | |
| Video Generation | EntityBench Intra-shot 1.0 | Imaging Quality56.41 | 4 | |
| Intra-shot quality evaluation | EntityBench | Subject Consistency75.9 | 4 |