Taming Teacher Forcing for Masked Autoregressive Video Generation

About

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum• 2025

Related benchmarks

Task	Dataset	Result
Video Prediction	Kinetics-600 (test)	--	46
Video Generation	Kinetics-600	FVD11.5	22
Unconditional video generation	UCF-101 256x256	FVD (256x256, 2048)297.8	12
Unconditional video generation	UCF-101	FVD (2048 Dim)421	7

Showing 4 of 4 rows

Other info

Code

Follow for update

@wizwand_team Discord