Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MLLM Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multimodal UnderstandingMLLM Evaluation Suite (HallusionBench, MME, AI2D, RWQA, SQA, POPE, MMBench, CCB, VSR, V7W) (test)
HallusionBench Score62.93
32
Multimodal Large Language Model EvaluationMLLM Evaluation Suite
Average Score (All)56.7
22
Multimodal UnderstandingMLLM Evaluation Suite GQA MME POPE SQA VQAtext VizWiz MMBen AI2D v1.5 v1.6 Qwen2.5 (test)
GQA Score64.8
12
Multimodal Question AnsweringMLLM Evaluation Suite (GQA, MMB, MMB-CN, MME, POPE, SQA, VQAv2, VQA-Text, VizWiz) (test)
GQA Accuracy64.27
11
Multimodal Question AnsweringMLLM Evaluation Suite (HallBench, MME, TextVQA, ChartQA, AI2D, RealWorldQA, CCBench, OCRVQA, SQA-IMG, POPE) (test)
HallBench49.8
7
Multimodal UnderstandingMLLM Evaluation Suite (GQA, MMBench, MME, POPE, ScienceQA, VQA v2, HRBench-8k, XLRS)
GQA Score60.9
7
Multimodal UnderstandingMLLM Evaluation Suite (GQA, MMBench, MME, POPE, ScienceQA, VQAv2, MMMU, SEED-I) LLaVA-NeXT (test)
GQA Accuracy64.2
7
Multimodal Understanding and ReasoningMLLM Evaluation Suite (MME, MMB, VizWiz, POPE, GQA, RQA, VQAT, SQA) standard (test val)
MME Score2,375
5
Multimodal UnderstandingMLLM Evaluation Suite (GQA, MMB, MME, POPE, SQA, VQAv2, VQAText, MMMU, SEED-I, VizWiz)
GQA Score65.4
4
Multimodal Large Language Model EvaluationMLLM Evaluation Suite (MME, MMStar, SQA, RealWorldQA, MMMU, MMMU-P, VisuLogic, LogicVista, CRPE, POPE, HallBench) (test)
MME74.94
4
Showing 10 of 10 rows