Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

About

This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringScienceQA
Accuracy75.5
446
Image CaptioningTextCaps
CIDEr127.4
112
Image CaptioningNoCaps
CIDEr84.3
111
Document UnderstandingAI2D
Accuracy0.656
28
OCR-related understandingDocVQA
Score43.3
28
Document and OCRInfoVQA
Accuracy Score28.1
17
OCRChartQA
Score30.7
14
General Visual Question AnsweringMME-P
Rescaled Score1.23e+3
10
General Visual Question AnsweringGQA
Accuracy42.7
10
Holistic Multimodal Understanding14 Benchmarks Composite
Average Score58.7
10
Showing 10 of 18 rows

Other info

Follow for update