LLM-as-a-Judge

Benchmarks

Task Name	Dataset Name	SOTA Result
LLM-as-a-Judge	LLM-as-a-Judge (10-fold cross-validation)	CG Accuracy88	8
Multi-fidelity bandit optimization	LLM-as-a-judge residual-mismatch Λ=128000 (test)	Mean Cost-Weighted Pseudo-Regret4,023.4	4
Preference evaluation	LLM-as-a-Judge comparison set	TCRM Better Rate33.7	1

Showing 3 of 3 rows