| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Understanding | LLaVA Evaluation Suite 1.5 | Average Score100 | 95 | |
| Text Membership Inference Attack | LLaVA LLM Pre-training | AUC0.688 | 88 | |
| Visual Question Answering | LLaVA-W | ROUGE-L49.1 | 56 | |
| Multi-modal Understanding | LLaVA Multi-modal Evaluation Suite (GQA, MMB, MME, POPE, SQA, VQAv2, TextVQA, MMMU, SEED-I) v1.6 (test) | Average Score100 | 53 | |
| Text Membership Inference Attack | LLaVA VLLM Tuning | AUC0.993 | 44 | |
| Multimodal Understanding | LLaVA Evaluation Suite GQA, MMB, MMB-CN, MME, POPE, SQA, VQAV2, VQAText, VizWiz | GQA64.2 | 41 | |
| Jailbreak Defense | LLaVA v1.5 | ASR3.18 | 36 | |
| Toxicity Defense | LLaVA v1.5 | Toxicity Score22.35 | 36 | |
| General Vision-Language Understanding | LLaVA-OneVision | Score66.82 | 36 | |
| Visual Instruction Following | LLaVA-W | Score102 | 28 | |
| Multimodal Large Language Model Inference Efficiency | LLaVA 13B 1.5 (test) | TTFT (ms)60.2 | 21 | |
| Hallucination detection | llava | AUC ROC96.5 | 19 | |
| Multimodal Understanding | LLaVA High-IC tasks (MMB, POPE, MME, SEED, GQA) 1.5-7B | Performance Ratio94.7 | 18 | |
| Multi-modal Understanding and Reasoning | LLaVA-QA90 (test) | Accuracy6.69 | 18 | |
| Multi-modal Instruction Following | LLaVA-Wild | Average Score69.8 | 17 | |
| Multimodal Understanding | Aggregate LLaVA 1.5 Suite | Relative Average Score98.7 | 17 | |
| Vision-Language Understanding and Reasoning | LLaVA Multimodal Evaluation Suite (GQA, MMBench, MME, POPE, ScienceQA, VQAv2, TextVQA, SEED-Bench, MM-Vet, VizWiz) 1.5 (test/val) | GQA0.619 | 16 | |
| Large Vision-Language Model evaluation | LLaVA Evaluation Suite (MMBench, MME, MM-Vet, ScienceQA) 1.5 (test val) | MMBench68.5 | 16 | |
| Jailbreak Detection | LLaVA Vicuna-7B v1.6 | Accuracy92 | 13 | |
| Image Captioning | MC-LLaVA | Caption Recall (Single)83.6 | 11 | |
| Vision-Language | LLaVa 1.5 | GQA Score63.01 | 11 | |
| Jailbreak Attack | LLaVA 1.5 | ASR100 | 10 | |
| Vision Understanding | LLaVA-W | Score63 | 10 | |
| Large Vision-Language Model Evaluation | LLAVA (bench) | Score77.8 | 10 | |
| Adversarial Attack | LLaVA | CLIP Similarity (RN-50)0.2427 | 9 |