Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Anthropic-HH

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety EvaluationANTHROPIC HH (test)
Safety Score88.85
24
Preference ClassificationAnthropic HH Harmless (test)
Accuracy71.7
22
Dialogue GenerationAnthropic-HH (test)
Average Preference Score69.07
16
DialogueAnthropic-HH (distillation set)
Response Word Count73.53
16
Single-turn dialogueAnthropic HH
Win Rate69.18
12
Preference ClassificationAnthropic HH Helpful (test)
Accuracy57.6
7
Win rate evaluationANTHROPIC HH (test)
Win Rate88.82
6
Reward ModelingAnthropic HH (test)
Accuracy68.49
5
Sycophancy Bias DetectionAnthropic-HH
AUC0.711
5
Length Bias DetectionAnthropic-HH
AUC80
5
Reward ModelingAnthropic HH
Training Samples340,296
3
Reward ModelingAnthropic HH (unperturbed)
Win Rate63.28
2
Instruction TuningAnthropic HH (test)
Win Rate56.3
2
Instruction TuningAnthropic HH-RLHF (test)
Metric-
0
Showing 14 of 14 rows