LLaVA-OneVision: Easy Visual Task Transfer

About

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.4	2019
Visual Question Answering	VizWiz	Accuracy60.4	1820
Visual Question Answering	TextVQA	Accuracy71.1	1453
Visual Question Answering	GQA	Accuracy62.2	1425
Text-based Visual Question Answering	TextVQA	Accuracy84.5	962
Multimodal Understanding	MMBench	Accuracy80.8	847
Science Question Answering	ScienceQA	Accuracy65.84	791
Multimodal Evaluation	MME	Score2.31e+3	727
Multimodal Understanding	MM-Vet	MM-Vet Score60.6	631
Video Understanding	MVBench	Accuracy59.4	563

Showing 10 of 1289 rows

...

Other info

Code

Follow for update

@wizwand_team Discord