MaskBit: Embedding-free Image Generation via Bit Tokens
About
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters. The code for this project is available on https://github.com/markweberdev/maskbit.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)341.8 | 441 | |
| Image Generation | ImageNet 256x256 (val) | FID1.52 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS328.6 | 305 | |
| Image Reconstruction | ImageNet 256x256 | rFID1.51 | 93 | |
| Image Generation | ImageNet-1K 256x256 (val) | Inception Score328.6 | 85 | |
| Image Generation | ImageNet | FID1.52 | 68 | |
| Image Generation | ImageNet 256x256 (test) | FID1.52 | 46 | |
| Image Generation | ImageNet 256x256 (test val) | FID6.18 | 35 | |
| Class-conditional Image Generation | ImageNet 256x256 2012 (train val) | -- | 30 | |
| Image Reconstruction | COCO 2014 (val) | rFID8.3 | 3 |