Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge Evaluation Consistency on PreferenceBench
Loading...
79.73
Kappa
CalibraEval
57.3908
63.1904
68.99
74.7896
Oct 20, 2024
Kappa
ICC(2,k)
ICC(3,k)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Kappa
ICC(2,k)
ICC(3,k)
CalibraEval
Base Model=GPT4o
2024.10
79.73
97.29
97.6
GPT4o (Default)
Debiasing=None
2024.10
79.42
93.5
94.11
CalibraEval
Base Model=Llama-3-8B
2024.10
58.54
88.17
89.43
Llama-3-8B (Default)
Debiasing=None
2024.10
58.25
86.23
86.61
Feedback
Search any
task
Search any
task