| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-modal Understanding | LLaVA-Bench Wild | LLaVA^W Score91.2 | 52 | |
| Multimodal Conversation | LLaVA-Bench Wild | Score102 | 52 | |
| Visual Question Answering | LLaVA-Bench-In-The-Wild | Score87.1 | 38 | |
| Multimodal Evaluation | LLaVA-Bench | LLaVA-Bench Score79.2 | 38 | |
| Multimodal Evaluation | LLaVA-bench in-the-wild | Score97.3 | 36 | |
| Visual Instruction Following | LLaVA-Bench Wild | Score81.8 | 35 | |
| Multimodal Evaluation | LLaVA-Bench-Wild (LLaVA-W) | Overall Score97.3 | 24 | |
| Multimodal Understanding | LLaVA-Bench | Overall Score91.9 | 23 | |
| Multimodal Instruction Following | LLaVA-Bench In-the-Wild | Score93.1 | 23 | |
| Multimodal Conversation | LLaVA Bench | LLaVA Bench Score93.1 | 21 | |
| Multimodal Dialogue Evaluation | LLaVA-Bench Wild (test) | Score97.7 | 19 | |
| Multimodal Reasoning | LLaVA-Bench Wild | GPT-4 Score74.5 | 19 | |
| Visual Instruction Following Evaluation | LLaVA-Bench | Accuracy4.38 | 18 | |
| Utility Evaluation | LLaVA-Bench Coco | Score92.3 | 13 | |
| Visual Question Answering | LLaVA Bench | VQA ASR68.31 | 12 | |
| General Multimodal Evaluation | LLaVA-Bench Wild | Relative Score92.8 | 12 | |
| Multimodal Performance Evaluation | LLaVA-Bench In-the-Wild | General Score78.9 | 12 | |
| Helpfulness Evaluation | LLaVA-Bench | Conversation Score93.1 | 11 | |
| Visual Question Answering | LLaVA-Bench LLaVAW | Score89.1 | 10 | |
| Large Multi-modal Model Evaluation | LLaVA-Bench Tool Use (test) | Grounding0.893 | 8 | |
| Multimodal Tool Use | LLaVA-Bench Tool Use | Grounding89.3 | 8 | |
| Visual Instruction Following | LLaVA-Bench | Conversation Score93.9 | 8 | |
| Open-ended Visual Chat | LLaVA-Bench In-the-Wild (full) | Reasoning Score90.1 | 8 | |
| General Multi-modal Assistant Task | LLaVA-Bench (LLaVA-B) | Score77.5 | 7 | |
| Open-ended Visual Question Answering | LLaVA Bench v1 (test) | Relevance37.18 | 7 |