Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Let ViT Speak: Generative Language-Image Pre-training

About

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU44.5
3069
Visual Question AnsweringScienceQA
Accuracy77.5
446
Image CaptioningTextCaps
CIDEr135.4
112
Image CaptioningNoCaps
CIDEr88.3
111
OCR-related understandingDocVQA
Score57
28
Document UnderstandingAI2D
Accuracy0.689
28
Document and OCRInfoVQA
Accuracy Score33.9
17
OCRChartQA
Score45
14
Document and OCROCR-B OCRBench
Accuracy Score55.6
10
Document and OCRTextVQA
Accuracy Score59
10
Showing 10 of 21 rows

Other info

GitHub

Follow for update