MAGI-1: Autoregressive Video Generation at Scale
About
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Generation | VBench-Long 60 seconds | Subject Consistency79.46 | 74 | |
| Video Generation | VBench 5s | Quality Score82.04 | 73 | |
| Video Generation | VBench (test) | Semantic Score72.02 | 66 | |
| Video Generation | VBench Long | Motion Smoothness99.1 | 49 | |
| Video Generation | short videos 81-frames 240 prompts | Total Score5.25 | 38 | |
| Text-to-Video Generation | VBench (test) | Total Score79.18 | 37 | |
| Video Generation | VBench | Motion Smoothness98.43 | 37 | |
| Image-to-Video Generation | VBench I2V | -- | 24 | |
| Text-to-Video Generation | StoryEval-Bench 1.0 (test) | Human Score39.6 | 22 | |
| Video Generation | VBench 1.0 (test) | Image Quality0.6066 | 21 |