Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge Performance on RewardBench (test)
Loading...
2.72
Std Dev (Reward)
CalibraEval
2.1572
5.9561
9.755
13.5539
Oct 20, 2024
Std Dev (Reward)
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Std Dev (Reward)
Accuracy
CalibraEval
Backbone=Qwen-14B
2024.10
2.72
64.25
Pride
Backbone=Qwen-14B
2024.10
4.18
64.09
CalibraEval
Backbone=ChatGPT
2024.10
5.51
67.13
CalibraEval
Backbone=Llama-3-8B
2024.10
6.48
68.12
Pride
Backbone=Llama-3-8B
2024.10
7.51
66.54
Pride
Backbone=ChatGPT
2024.10
8.54
66.36
Qwen-14B
Backbone=Qwen-14B, Con...
2024.10
11.63
63.14
Llama-3-8B
Backbone=Llama-3-8B, C...
2024.10
15.01
65.79
ChatGPT
Backbone=ChatGPT, Conf...
2024.10
16.79
65.27
Feedback
Search any
task
Search any
task