Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Anthropic-HH

Benchmarks

Task NameDataset NameSOTA ResultTrend
Preference ClassificationAnthropic HH Harmless (test)
Accuracy71.7
22
Dialogue GenerationAnthropic-HH (test)
Average Preference Score69.07
16
DialogueAnthropic-HH (distillation set)
Response Word Count73.53
16
Single-turn dialogueAnthropic HH
Win Rate69.18
12
Preference ClassificationAnthropic HH Helpful (test)
Accuracy57.6
7
Reward ModelingAnthropic HH (test)
Accuracy68.49
5
Sycophancy Bias DetectionAnthropic-HH
AUC0.711
5
Length Bias DetectionAnthropic-HH
AUC80
5
Instruction TuningAnthropic HH (test)
Win Rate56.3
2
Instruction TuningAnthropic HH-RLHF (test)
Metric-
0
Showing 10 of 10 rows