Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-as-a-Judge

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-fidelity bandit optimizationLLM-as-a-judge residual-mismatch Λ=128000 (test)
Mean Cost-Weighted Pseudo-Regret4,023.4
4
Preference evaluationLLM-as-a-Judge comparison set
TCRM Better Rate33.7
1
Showing 2 of 2 rows