Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Harmful Content Detection on BeaverTails Harmful (held-out target labels)
Loading...
0.793
AUROC
Quotient Transfer
0.62556
0.66903
0.7125
0.75597
May 12, 2026
AUROC
Updated 21d ago
Evaluation Results
Method
Method
Links
AUROC
Quotient Transfer
Target Model=Qwen-2.5-...
2026.05
0.793
Safety Monitor (In-model)
Model=Qwen-2.5-7B-Inst...
2026.05
0.791
Quotient Transfer
Target Model=Qwen-14B,...
2026.05
0.781
Quotient Transfer
Target Model=Mistral,...
2026.05
0.773
Quotient Transfer
Target Model=Qwen-3B,...
2026.05
0.772
Random-initialization control
2026.05
0.632
Feedback
Search any
task
Search any
task