Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM-as-a-Judge Performance on RewardBench (test)
Loading...
2.72
Std Dev (Reward)
CalibraEval
2.1572
5.9561
9.755
13.5539
Oct 20, 2024
Std Dev (Reward)
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Std Dev (Reward)
Accuracy
CalibraEval
Backbone=Qwen-14B
2024.10
2.72
64.25
Pride
Backbone=Qwen-14B
2024.10
4.18
64.09
CalibraEval
Backbone=ChatGPT
2024.10
5.51
67.13
CalibraEval
Backbone=Llama-3-8B
2024.10
6.48
68.12
Pride
Backbone=Llama-3-8B
2024.10
7.51
66.54
Pride
Backbone=ChatGPT
2024.10
8.54
66.36
Qwen-14B
Backbone=Qwen-14B, Con...
2024.10
11.63
63.14
Llama-3-8B
Backbone=Llama-3-8B, C...
2024.10
15.01
65.79
ChatGPT
Backbone=ChatGPT, Conf...
2024.10
16.79
65.27
Feedback
Search any
task
Search any
task