Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

About

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang• 2025

Related benchmarks

Task	Dataset	Result
Multi-shot cinematic audio-video generation	CineBench Human Evaluation 1.0	Video Quality3.91	13
Multi-Shot Audio-Video Generation	CineBench	Audio Quality (AQ)0.56	13
Multi-shot Video Generation	90 prompts evaluation suite	Type Accuracy20.33	9
Video Generation	User Study (test)	Video Quality Score49.23	8
Long Video Generation	User Study Evaluation Set (test)	Visual Consistency3.08	8
Multi-shot Video Generation	Gemini 100 multi-shot video prompts 2.5 Pro	Intra-shot Consistency (Subject)0.646	8
Text-to-multi-shot video generation	T2MSV Text-to-multi-shot	Character Coherence (Inter-shot)0.5472	6
Auto-regressive scene extension	EvalCrafter	VQA_A75.48	5
Auto-regressive scene extension	T2V-CompBench	Action Binding Score2.3	5
Multi-scene video generation	Multi-scene evaluation dataset 1.0 (test)	Visual Consistency70.95	5

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord