Multimodal Evaluation Consistency

Benchmarks

Dataset Name	SOTA Method	Metric	Trend
MLLM-as-a-Judge, RichHF-18K, GenAI-Bench	GPT-4o	Average Score44.2		22	2mo ago
MLLM-as-a-Judge	GPT-4o	CO Score39.6		22	2mo ago

Showing 2 of 2 rows