Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Safety Detection on PKU-SafeRLHF (held-out)
Loading...
90.6
AUROC
MultiLayer-DIM
86.856
87.828
88.8
89.772
May 18, 2026
AUROC
Updated 13d ago
Evaluation Results
Method
Method
Links
AUROC
MultiLayer-DIM
Protocol=Leave-one-ben...
2026.05
90.6
Geometry-Lite
Protocol=Leave-one-ben...
2026.05
90.3
TaT-Disp-LSTM
Protocol=Leave-one-ben...
2026.05
90.2
MultiLayer-Linear
Protocol=Leave-one-ben...
2026.05
90
Best single-layer probe
Protocol=Leave-one-ben...
2026.05
87
Feedback
Search any
task
Search any
task