| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Understanding | LLaVA Evaluation Suite 1.5 | Average Score100 | 95 | |
| Text Membership Inference Attack | LLaVA LLM Pre-training | AUC0.688 | 88 | |
| Visual Question Answering | LLaVA-W | ROUGE-L49.1 | 56 | |
| Multi-modal Understanding | LLaVA Multi-modal Evaluation Suite (GQA, MMB, MME, POPE, SQA, VQAv2, TextVQA, MMMU, SEED-I) v1.6 (test) | Average Score100 | 53 | |
| Text Membership Inference Attack | LLaVA VLLM Tuning | AUC0.993 | 44 | |
| Multimodal Understanding | LLaVA Evaluation Suite GQA, MMB, MMB-CN, MME, POPE, SQA, VQAV2, VQAText, VizWiz | GQA64.2 | 41 | |
| Vision-Language Understanding and Reasoning | LLaVA Multimodal Evaluation Suite (GQA, MMBench, MME, POPE, ScienceQA, VQAv2, TextVQA, SEED-Bench, MM-Vet, VizWiz) 1.5 (test/val) | GQA62 | 41 | |
| Jailbreak Defense | LLaVA v1.5 | ASR3.18 | 36 | |
| Toxicity Defense | LLaVA v1.5 | Toxicity Score22.35 | 36 | |
| General Vision-Language Understanding | LLaVA-OneVision | Score66.82 | 36 | |
| Multimodal Evaluation | LLaVA Evaluation Suite 7B v1.5 (test) | GQA61.9 | 34 | |
| Visual Instruction Following | LLaVA-W | Score102 | 28 | |
| Multimodal Understanding and Question Answering | LLaVA 7B Evaluation Suite (GQA, MMBench, MMBench-CN, MME, POPE, ScienceQA, VQAv2, TextVQA, SEED-Bench, VizWiz) 1.5 | GQA Accuracy61.9 | 22 | |
| Multimodal Large Language Model Inference Efficiency | LLaVA 13B 1.5 (test) | TTFT (ms)60.2 | 21 | |
| Hallucination detection | llava | AUC ROC96.5 | 19 | |
| Multimodal Understanding | LLaVA High-IC tasks (MMB, POPE, MME, SEED, GQA) 1.5-7B | Performance Ratio94.7 | 18 | |
| Multi-modal Understanding and Reasoning | LLaVA-QA90 (test) | Accuracy6.69 | 18 | |
| Multi-modal Instruction Following | LLaVA-Wild | Average Score69.8 | 17 | |
| Multimodal Understanding | Aggregate LLaVA 1.5 Suite | Relative Average Score98.7 | 17 | |
| Multimodal Visual Question Answering | LLaVA Evaluation Suite (GQA, MME, POPE, SQA-Img, VizWiz, VQAv2, MMB-En) 1.5 | GQA61.9 | 16 | |
| Large Vision-Language Model evaluation | LLaVA Evaluation Suite (MMBench, MME, MM-Vet, ScienceQA) 1.5 (test val) | MMBench68.5 | 16 | |
| Overall Multimodal Performance | LLaVA 665K Evaluation Suite | Relative Score100.3 | 15 | |
| Jailbreak Detection | LLaVA Vicuna-7B v1.6 | Accuracy92 | 13 | |
| Image Captioning | MC-LLaVA | Caption Recall (Single)83.6 | 11 | |
| Vision-Language | LLaVa 1.5 | GQA Score63.01 | 11 |