VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
About
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Reconstruction | ImageNet 256x256 | rFID0.4 | 202 | |
| Image Reconstruction | ImageNet-1k 256 x 256 (val) | rFID0.4 | 112 | |
| 4x super-resolution | FFHQ 256x256 | PSNR24.11 | 36 | |
| Class-conditional Image Generation | ImageNet-1K 256x256 | FID3.62 | 26 | |
| Image Reconstruction | MS-COCO 256x256 2017 (val) | PSNR24.71 | 23 | |
| Image Reconstruction | ImageNet-1k 512x512 resolution (val) | rFID0.51 | 18 | |
| Image Reconstruction | ImageNet 512x512 | rFID0.51 | 12 | |
| Class-conditional Image Generation | ImageNet-1k 512x512 | FID3.6 | 11 | |
| Zero-shot Image Generation | ImageNet high resolutions 512-1024 | Quality (1:1, 512x512)7.62 | 5 | |
| Zero-shot Image Generation | ImageNet low resolutions 256-512 | Score (1:1, 256x256)7.62 | 5 |