Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-as-a-Judge Evaluation Consistency on PreferenceBench

79.73Kappa

CalibraEval

57.390863.190468.9974.7896Oct 20, 2024
Updated 1mo ago

Evaluation Results

MethodLinks
2024.10
79.7397.2997.6
2024.10
79.4293.594.11
2024.10
58.5488.1789.43
2024.10
58.2586.2386.61