HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
About
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation Quality Evaluation | EvalVerse | Machine Win Ratio81 | 172 | |
| Text-to-Video | ShotVerse-Bench | Motion Type Appropriateness4.324 | 12 | |
| Multi-shot Video Generation | ShotVerse-Bench | Semantic Consistency (Global)0.297 | 7 | |
| Multi-shot video storytelling | ST-Bench | Aesthetic Quality56.53 | 5 | |
| Intra-shot prompt-following alignment | Pillar 2 Intra-shot prompt-following alignment | Intra-shot Character Presence88.2 | 4 | |
| Cross-shot consistency | Pillar 3 Cross-shot consistency | CS Consistency (Face)75.1 | 4 | |
| Intra-shot quality evaluation | EntityBench | Subject Consistency86 | 4 | |
| Video Generation | EntityBench Cross-shot 1.0 | Cross-shot Face Consistency75.1 | 4 | |
| Video Generation | EntityBench Intra-shot 1.0 | Imaging Quality49.97 | 4 | |
| Multi-shot Video Generation | Multi-shot Video Benchmark 15s | Aesthetic Score0.5842 | 3 |