Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM-as-a-Judge Evaluation Consistency on PreferenceBench
Loading...
79.73
Kappa
CalibraEval
57.3908
63.1904
68.99
74.7896
Oct 20, 2024
Kappa
ICC(2,k)
ICC(3,k)
Updated 4d ago
Evaluation Results
Method
Method
Links
Kappa
ICC(2,k)
ICC(3,k)
CalibraEval
Base Model=GPT4o
2024.10
79.73
97.29
97.6
GPT4o (Default)
Debiasing=None
2024.10
79.42
93.5
94.11
CalibraEval
Base Model=Llama-3-8B
2024.10
58.54
88.17
89.43
Llama-3-8B (Default)
Debiasing=None
2024.10
58.25
86.23
86.61
Feedback
Search any
task
Search any
task