SafeRLHF

Benchmarks

Task Name	Dataset Name	SOTA Result
Safety Classification	SafeRLHF	F1 Score0.94	48
Response Harmfulness Detection	SafeRLHF	F1 Score72.1	41
Harmlessness preference labeling accuracy	SafeRLHF-RMB (test)	Bench Accuracy70.6	15
Data Attribution	SafeRLHF	LDS Correlation0.4608	10
Reward Modeling	SafeRLHF Reversed	Accuracy88.4	9
Reward Modeling	SafeRLHF Standard	Accuracy89.8	9
Safety Alignment	SafeRLHF	Win Rate83	8
Safety Moderation	SafeRLHF	F1 Score69.9	7
Constitutional AI Alignment	SafeRLHF (test)	Likert Score (5-Point)4.652	6
Multi-label Safety Categorization	saferlhf	Macro Accuracy48.35	4
Safety Alignment	SafeRLHF 30K (test)	Safety94.3	3
Safe and Helpful Response Generation	SafeRLHF-30K (test)	Safe Response Rate94.3	3
Multi-objective RLHF alignment	safeRLHF (test)	Win Rate52	1

Showing 13 of 13 rows