SOTA LLM-as-a-judge evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
MT-Bench	Qwen3-32B REAL (ours)	Pearson's r0.689	36	4mo ago
FLASK	Qwen3-32B REAL (ours)	Pearson's r0.589	36	4mo ago
FB Bench (Feedback Bench)	Qwen3-8B TRACT	Pearson's r0.949	36	4mo ago
JudgeBench (test)	Skywork-Reward-V2-Llama-3.1-8B-40M	Score83.4	22	4mo ago
Average Across FB Bench, FLASK, Vic. Bench, MT Bench	Qwen3-32B REAL (ours)	Pearson (r)71	20	4mo ago
Vicuna Benchmark	Qwen3-32B REAL (ours)	Pearson Correlation (r)65.1	20	4mo ago
Vicuna Bench	TRACT	Pearson Correlation (r)0.605	16	5mo ago
English (test)	Bucket-SFT	Overall Score72.7	11	1mo ago
Chinese (test)	Bucket-SFT	Overall Score82.7	7	1mo ago
100 Romanian synthetic prompts (test)		Fluency4.71	7	5mo ago
Coding Qwen (n=8884)	AURA	Adjusted Accuracy77.74	5	1mo ago
Coding GPT-5.4	AURA	Adjusted Accuracy67.2	5	1mo ago

Showing 12 of 12 rows