Share your thoughts, 1 month free Claude Pro on usSee more

LLM-as-a-Judge on Preference Bench (test)

2.82Std Dev

CalibraEval

Updated 3mo ago

Evaluation Results

Method	Links
CalibraEval 2024.10		2.82	85.98
ChatGPT 2024.10		3.04	85.61
Llama-3-8B 2024.10		3.36	83.43
CalibraEval 2024.10		3.42	83.98
Pride 2024.10		3.51	85.68
Pride 2024.10		4.35	83.24
CalibraEval 2024.10		5.12	83.88
Pride 2024.10		7.36	83.55
Qwen-14B 2024.10		11.99	80.68