Share your thoughts, 1 month free Claude Pro on usSee more

LLM-as-a-Judge Evaluation Consistency on PreferenceBench

79.73Kappa

CalibraEval

Updated 1mo ago

Evaluation Results

Method	Links
CalibraEval 2024.10		79.73	97.29	97.6
GPT4o (Default) 2024.10		79.42	93.5	94.11
CalibraEval 2024.10		58.54	88.17	89.43
Llama-3-8B (Default) 2024.10		58.25	86.23	86.61