| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| VLM-as-a-Judge | MLLM-as-a-Judge | Accuracy75.78 | 32 | |
| Multimodal Evaluation Consistency | MLLM-as-a-Judge | CO Score39.6 | 22 | |
| Large Multimodal Model Evaluation | MLLM-as-a-Judge v1.0 (test) | Overall Score49 | 16 | |
| Reward Modeling | MLLM-as-a-Judge (MaaJ) | Accuracy72.18 | 13 | |
| Human Consistency Evaluation | MLLM-as-a-Judge | CO Consistency Score30.3 | 11 | |
| Pointwise Scoring | MLLM-as-a-Judge in-domain v1.0 (test) | ImageDC Score80.2 | 9 |