Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

About

The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance on ImageNet and UCF benchmarks. We also provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks (1.93 vs. 0.78 rFID on ImageNet original resolution). Furthermore, we explore its application in plain auto-regressive models to validate scalability properties, producing a family of auto-regressive image generation models ranging from 300M to 1.5B. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ``next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan• 2024

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)271.8
441
Image GenerationImageNet 256x256 (val)
FID2.33
307
Class-conditional Image GenerationImageNet 256x256 (val)
FID2.33
293
Class-conditional Image GenerationImageNet 256x256 (test)
FID3.08
167
Image ReconstructionImageNet 256x256
rFID0.34
93
Image GenerationImageNet-1K 256x256 (val)
Inception Score271.8
85
Image ReconstructionImageNet1K (val)
FID1.17
83
Class-conditional Image GenerationImageNet class-conditional 256x256 (test val)
FID2.33
75
Class-conditional Image GenerationImageNet-1k (val)
FID2.33
68
Image ReconstructionImageNet (val)
rFID1.17
54
Showing 10 of 24 rows

Other info

Code

Follow for update