MAGI-1: Autoregressive Video Generation at Scale
About
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | VBench (test) | Semantic Score67.74 | 35 | |
| Video Generation | VBench 5s | Total Score79.18 | 35 | |
| Video Generation | VBench 1.0 (test) | Image Quality0.6066 | 21 | |
| Video Generation | VBench short video (test) | Subject Consistency67.74 | 16 | |
| Short Video Generation | VBench 2024 | Total Score79.18 | 11 | |
| Short Video Generation | VBench official prompts | Total Score79.18 | 11 | |
| Video Generation | VBench Overall | Throughput (FPS)0.19 | 11 | |
| Video Generation | Single-prompt 5-second setting | Total Score79.18 | 11 | |
| Video Generation | VBench 5-10s | Temporal Consistency89.1 | 9 | |
| Video Generation | 75s videos | Text Alignment24.95 | 9 |