Anthropic-HH

Benchmarks

Task Name	Dataset Name	SOTA Result
Safety Evaluation	ANTHROPIC HH (test)	Safety Score88.85	24
Preference Classification	Anthropic HH Harmless (test)	Accuracy71.7	22
Single-turn dialogue	Anthropic HH	Win Rate69.18	18
Dialogue Generation	Anthropic-HH (test)	Average Preference Score69.07	16
Dialogue	Anthropic-HH (distillation set)	Response Word Count73.53	16
LLM Alignment	Anthropic-HH (test)	GPT-4o Win Rate57.53	8
Preference Classification	Anthropic HH Helpful (test)	Accuracy57.6	7
Win rate evaluation	ANTHROPIC HH (test)	Win Rate88.82	6
Multi-turn Dialogue	Anthropic HH	Win Rate77.52	5
Reward Modeling	Anthropic HH (test)	Accuracy68.49	5
Sycophancy Bias Detection	Anthropic-HH	AUC0.711	5
Length Bias Detection	Anthropic-HH	AUC80	5
Reward Modeling	Anthropic HH	Training Samples340,296	3
Reward Modeling	Anthropic HH (unperturbed)	Win Rate63.28	2
Instruction Tuning	Anthropic HH (test)	Win Rate56.3	2
Instruction Tuning	Anthropic HH-RLHF (test)	Metric-	0

Showing 16 of 16 rows