LLaVA-OneVision: Easy Visual Task Transfer
About
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy71.1 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy60.4 | 1043 | |
| Visual Question Answering | GQA | Accuracy62.2 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy87.4 | 935 | |
| Multimodal Evaluation | MME | Score2.31e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy84.5 | 496 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score60.6 | 418 | |
| Visual Question Answering | GQA | Accuracy62.14 | 374 | |
| Multimodal Understanding | MMBench | Accuracy80.8 | 367 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy85.2 | 337 |
Showing 10 of 590 rows
...