Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Aggregated Multimodal EvaluationMultimodal Evaluation Suite Average
Average Relative Performance100
21
Multimodal UnderstandingMultimodal Evaluation Suite (GQA, MMBench, MMBench-CN, MME, POPE, SEED-Bench, TextVQA, VizWiz, OCRBench)
GQA Score61.5
21
Statistical Significance AnalysisMultimodal Evaluation Suite (Tiny-ImageNet, CIFAR-100, FMNIST, Caltech-256, AG News, MMLU, VQA, CommonGen) (test)
Significance Rate (TAP Better)100
18
Multimodal UnderstandingMultimodal Evaluation Suite MMB, MME, SQA, VQA^T, MMB^C, MMVet, MMstar, AI2D
MMB Score64.6
17
Multimodal Understanding and ReasoningMultimodal Evaluation Suite (MMVet, MMBench_EN, SEED-Bench, LLaVABench, POPE, MME-P, MMVP, MMStar) (Random Sampling Splits of CC12M)
MMVet Score30.1
13
Multimodal UnderstandingMultimodal Evaluation Suite Table 4
ALL AVG Score73.6
9
Multimodal UnderstandingMultimodal Evaluation Suite GQA, SQA-I, VQA-T, MME, VQAv2, MMB
GQA Score59.7
7
Comprehensive Multimodal EvaluationMultimodal Evaluation Suite Composite
Overall Score68.7
5
Showing 8 of 8 rows