OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

About

This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	ScienceQA	Accuracy75.5	525
Image Captioning	TextCaps	CIDEr127.4	154
Image Captioning	NoCaps	CIDEr84.3	130
General Visual Question Answering	GQA	Accuracy42.7	35
General Visual Question Answering	VQA v2	Accuracy44	28
Document Understanding	AI2D	Accuracy0.656	28
OCR-related understanding	DocVQA	Score43.3	28
Document and OCR	InfoVQA	Accuracy Score28.1	17
OCR	ChartQA	Score30.7	14
General Visual Question Answering	MME-P	Rescaled Score1.23e+3	10

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord