FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

About

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran• 2026

Related benchmarks

Task	Dataset	Result
Video Generation	UCF101	FVD170.1	68
Video Generation	SkyTimelapse	FVD39.5	22
Text-to-Video Generation	VBench	Total Score79.12	16
Video Generation	FaceForensics	FVD16.6	15
Video Generation	Taichi-HD	FVD95.5	12

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord