Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

About

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang• 2026

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)311.7
967
Image ReconstructionImageNet 256x256
rFID0.47
202
Class-conditional Image GenerationImageNet 512x512 (val)--
102
Image ReconstructionMS-COCO 2017 (val)
rFID1.8
33
Image ReconstructionImageNet-1K 256x256
rFID0.48
31
Image ReconstructionImageNet-1k 512x512 resolution (val)
rFID0.42
18
Image RepresentationImageNet-1K 256x256
Linear Accuracy78.3
15
Image ReconstructionImageNet 1024x1024 (val)
rFID1.87
6
Showing 8 of 8 rows

Other info

Follow for update