Let ViT Speak: Generative Language-Image Pre-training

About

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU44.5	3089
Visual Question Answering	ScienceQA	Accuracy77.5	525
Image Captioning	TextCaps	CIDEr135.4	154
Image Captioning	NoCaps	CIDEr88.3	130
General Visual Question Answering	GQA	Accuracy45.5	35
OCR-related understanding	DocVQA	Score57	28
Document Understanding	AI2D	Accuracy0.689	28
General Visual Question Answering	VQA v2	Accuracy49.1	28
Document and OCR	InfoVQA	Accuracy Score33.9	17
OCR	ChartQA	Score45	14

Showing 10 of 21 rows

Other info

GitHub

Follow for update

@wizwand_team Discord