QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

About

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Kr\"ahenb\"uhl, De-An Huang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	GQA	Accuracy61.8	1425
Multi-discipline Multimodal Understanding	MMMU	--	363
Multimodal Understanding	MME	--	207
Visual Question Answering	GQA	Score61.8	193
Visual Understanding	MM-Vet	MM-Vet Score33.3	167
Text-to-Image Generation	DPG-Bench	DPG Score78.17	156
Image Reconstruction	ImageNet1K (val)	FID1.46	124
Image Reconstruction	ImageNet-1k 256 x 256 (val)	rFID1.46	112
Text-to-Image Generation	GenEval	GenEval Score0.48	108

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord