Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

About

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran• 2026

Related benchmarks

TaskDatasetResultRank
Video GenerationUCF101
FVD170.1
68
Video GenerationSkyTimelapse
FVD39.5
22
Video GenerationFaceForensics
FVD16.6
15
Video GenerationTaichi-HD
FVD95.5
12
Text-to-Video GenerationVBench
Total Score79.12
7
Showing 5 of 5 rows

Other info

Follow for update