Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

About

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

Maitreya Patel, Jingtao Li, Weiming Zhuang, Yezhou Yang, Lingjuan Lv• 2026

Related benchmarks

TaskDatasetResultRank
Image ReconstructionImageNet 256x256
rFID0.4
202
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID0.4
112
4x super-resolutionFFHQ 256x256
PSNR24.11
36
Class-conditional Image GenerationImageNet-1K 256x256
FID3.62
26
Image ReconstructionMS-COCO 256x256 2017 (val)
PSNR24.71
23
Image ReconstructionImageNet-1k 512x512 resolution (val)
rFID0.51
18
Image ReconstructionImageNet 512x512
rFID0.51
12
Class-conditional Image GenerationImageNet-1k 512x512
FID3.6
11
Zero-shot Image GenerationImageNet high resolutions 512-1024
Quality (1:1, 512x512)7.62
5
Zero-shot Image GenerationImageNet low resolutions 256-512
Score (1:1, 256x256)7.62
5
Showing 10 of 17 rows

Other info

Follow for update