Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
About
The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance on ImageNet and UCF benchmarks. We also provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks (1.93 vs. 0.78 rFID on ImageNet original resolution). Furthermore, we explore its application in plain auto-regressive models to validate scalability properties, producing a family of auto-regressive image generation models ranging from 300M to 1.5B. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ``next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)271.8 | 441 | |
| Image Generation | ImageNet 256x256 (val) | FID2.33 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID2.33 | 293 | |
| Class-conditional Image Generation | ImageNet 256x256 (test) | FID3.08 | 167 | |
| Image Reconstruction | ImageNet 256x256 | rFID0.34 | 93 | |
| Image Generation | ImageNet-1K 256x256 (val) | Inception Score271.8 | 85 | |
| Image Reconstruction | ImageNet1K (val) | FID1.17 | 83 | |
| Class-conditional Image Generation | ImageNet class-conditional 256x256 (test val) | FID2.33 | 75 | |
| Class-conditional Image Generation | ImageNet-1k (val) | FID2.33 | 68 | |
| Image Reconstruction | ImageNet (val) | rFID1.17 | 54 |