SOTA Human Consistency Evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
GenAI-Bench	VQAScore	Kendall's Tau-c38.4	16	2mo ago
RichHF-18K	Gemini-2.5-Pro	Kendall's Tau33.9	11	2mo ago
MLLM-as-a-Judge	LLaVA-Critic	CO Consistency Score30.3	11	2mo ago
Q-Reasoning (test)	Proposed Human-Like Reasoning Framework (detailed)	ROUGE-1 Score51.4	6	5mo ago

Showing 4 of 4 rows