Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaVA-OneVision: Easy Visual Task Transfer

About

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy60.4
1525
Object Hallucination EvaluationPOPE
Accuracy88.4
1455
Visual Question AnsweringTextVQA
Accuracy71.1
1285
Visual Question AnsweringGQA
Accuracy62.2
1249
Text-based Visual Question AnsweringTextVQA
Accuracy84.5
807
Multimodal EvaluationMME
Score2.31e+3
658
Multimodal UnderstandingMMBench
Accuracy80.8
637
Human-Object Interaction DetectionHICO-DET (test)
mAP (full)4.25
544
Multimodal UnderstandingMM-Vet
MM-Vet Score60.6
531
Visual Question AnsweringGQA
Accuracy62.14
505
Showing 10 of 969 rows
...

Other info

Code

Follow for update