Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge on Preference Bench (test)
Loading...
2.82
Std Dev
CalibraEval
2.4532
4.9291
7.405
9.8809
Oct 20, 2024
Std Dev
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Std Dev
Accuracy
CalibraEval
Backbone=ChatGPT
2024.10
2.82
85.98
ChatGPT
Backbone=ChatGPT, Conf...
2024.10
3.04
85.61
Llama-3-8B
Backbone=Llama-3-8B, C...
2024.10
3.36
83.43
CalibraEval
Backbone=Llama-3-8B
2024.10
3.42
83.98
Pride
Backbone=ChatGPT
2024.10
3.51
85.68
Pride
Backbone=Llama-3-8B
2024.10
4.35
83.24
CalibraEval
Backbone=Qwen-14B
2024.10
5.12
83.88
Pride
Backbone=Qwen-14B
2024.10
7.36
83.55
Qwen-14B
Backbone=Qwen-14B, Con...
2024.10
11.99
80.68
Feedback
Search any
task
Search any
task