Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MLLM-as-a-Judge

Benchmarks

Task NameDataset NameSOTA ResultTrend
VLM-as-a-JudgeMLLM-as-a-Judge
Accuracy75.78
32
Multimodal Evaluation ConsistencyMLLM-as-a-Judge
CO Score39.6
22
Large Multimodal Model EvaluationMLLM-as-a-Judge v1.0 (test)
Overall Score49
16
Reward ModelingMLLM-as-a-Judge (MaaJ)
Accuracy72.18
13
Human Consistency EvaluationMLLM-as-a-Judge
CO Consistency Score30.3
11
Pointwise ScoringMLLM-as-a-Judge in-domain v1.0 (test)
ImageDC Score80.2
9
Showing 6 of 6 rows