Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StereoSet

Benchmarks

Task NameDataset NameSOTA ResultTrend
Bias MeasurementStereoSet
Overall SS63.17
25
Stereotype Bias EvaluationStereoSet Gender
LMS Score99.59
24
Out-of-Domain (OOD) Bias EvaluationStereoset
Accuracy67.2
14
Reasoning-intensive classificationStereoSet (test)
Macro F1 Score93
12
Stereotypical Bias EvaluationStereoSet (dev)
Overall LMS Score84.172
12
Stereotypical Bias EvaluationStereoSet (SS)
StereoSet Score (SS)41
10
Personalized Response GenerationStereoSet Explicit Preference (test)
Preference Score79.1
8
Personalized Response GenerationStereoSet Implicit Preference (test)
Pref Score0.474
8
Stereotype Bias EvaluationStereoSet (test)
Gender SS77.12
8
Stereotype Bias EvaluationStereoSet Overall
LMS91.05
8
Stereotype DetectionStereoSet n=237
Accuracy93.4
5
Language Model DebiasingStereoSet (test)
LMS Score0.8535
5
Bias EvaluationStereoSet intrasentence
Gender SS67.34
3
Stereotype Bias EvaluationStereoSet Race
LMS77
2
Stereotype Bias EvaluationStereoSet Religion
LMS84
2
Stereotype Bias EvaluationStereoSet Profession
LMS78.4
2
Showing 16 of 16 rows