Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PreferenceBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM-as-a-JudgePreferenceBench
Rstd0.69
36
LLM-as-a-JudgePreferenceBench
Accuracy90.2
21
LLM-as-a-Judge Evaluation ConsistencyPreferenceBench
Kappa79.73
4
Showing 3 of 3 rows