Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM-as-a-Judge on Preference Bench (test)
Loading...
2.82
Std Dev
CalibraEval
2.4532
4.9291
7.405
9.8809
Oct 20, 2024
Std Dev
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Std Dev
Accuracy
CalibraEval
Backbone=ChatGPT
2024.10
2.82
85.98
ChatGPT
Backbone=ChatGPT, Conf...
2024.10
3.04
85.61
Llama-3-8B
Backbone=Llama-3-8B, C...
2024.10
3.36
83.43
CalibraEval
Backbone=Llama-3-8B
2024.10
3.42
83.98
Pride
Backbone=ChatGPT
2024.10
3.51
85.68
Pride
Backbone=Llama-3-8B
2024.10
4.35
83.24
CalibraEval
Backbone=Qwen-14B
2024.10
5.12
83.88
Pride
Backbone=Qwen-14B
2024.10
7.36
83.55
Qwen-14B
Backbone=Qwen-14B, Con...
2024.10
11.99
80.68
Feedback
Search any
task
Search any
task