Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

About

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng• 2025

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU
Accuracy55.4
266
Visual Question AnsweringChartQA
Accuracy86.64
239
Chart Question AnsweringChartQA
Accuracy87.1
229
Multimodal UnderstandingSEED-Bench--
203
Multimodal UnderstandingMMStar
Accuracy67.7
197
Diagram Question AnsweringAI2D
AI2D Accuracy84.2
196
Visual Mathematical ReasoningMathVista
Accuracy69.6
189
Multimodal UnderstandingSEED
Accuracy77.3
136
Multimodal ReasoningMMMU (val)
Accuracy55.44
114
Multimodal UnderstandingSEED-2-Plus
Accuracy69.2
99
Showing 10 of 84 rows
...

Other info

Follow for update