Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Response Harmfulness Detection on HarmBench Suite (HarmBench, SafeRLHF, BeaverTails, XSTest, WildGuard)

0.8333Macro Avg F1

COLAGUARD

0.3180840.4518420.58560.719358May 27, 2026
Updated 5d ago

Evaluation Results

MethodLinks
2026.05
0.8333---0.8155
2026.05
0.8313---0.8122
2026.05
0.8256---0.8022
2026.05
0.8249---0.808
2026.05
0.81---0.7795
2026.05
0.8078---0.7906
2026.05
0.8004---0.7867
2026.05
0.7434---0.7445
2026.05
0.7348---0.766
2026.05
0.717---0.6699
2026.05
0.7115---0.6497
2026.05
0.6954---0.6922
2026.05
0.6934---0.667
2026.05
0.6632---0.6549
2026.05
0.6588---0.6941
2026.05
0.6146---0.6355
2026.05
0.6---0.5827
2026.05
0.5962---0.6279
2026.05
0.576---0.5567
2026.05
0.5464---0.5773
2026.05
0.3379---0.2724
2026.05
-3,801.03281.960.2122-
2026.05
-318.9132.5041-
2026.05
-4,407.8289.40.1838-
2026.05
-34212.92.3601-