SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

About

In this paper, we introduce SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete representations for multimodal understanding and generation. Recently, unified image tokenizers have sparked exploration within the research community, which is designed to capture high-level semantic features for understanding and retaining low-level pixel features for generation. Previous works attempt to train a unified image tokenizer by combining loss for semantic distillation and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through a novel semantic-guided hierarchical codebook, which builds pixel sub-codebooks on a pretrained semantic codebook. This design decouples the semantic and pixel in terms of structure and training strategy, enabling the tokenizer to capture pixel features while retaining its ability to comprehend high-level semantic information. Our experiments demonstrate that SemHiTok achieves leading performance in image reconstruction and multimodal understanding under the LLaVA-v1.5 setting. Further, we develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks. Extensive experiments confirm our analysis, showing that our unified image tokenizer architecture achieves a better trade-off.

Zisheng Chen, Chunwei Wang, Runhui Huang, Hongbin Xu, Xiuwei Chen, Jun Zhou, Jianhua Han, Hang Xu, Xiaodan Liang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	GQA	Accuracy60.3	1445
Multimodal Evaluation	MME	--	902
Multimodal Understanding	MMBench	--	887
Multimodal Understanding	MM-Vet	MM-Vet Score36.6	664
Multimodal Understanding	SEED-Bench	--	571
Multimodal Understanding	MME	MME Score1.99e+3	207
Visual Question Answering	GQA	Score61.7	193
Text-to-Image Generation	DPG-Bench	DPG Score83.59	156
Visual Question Answering	GQA	GQA Score60.3	152

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord