Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Safety alignment against harmful fine-tuning on BeaverTails
Loading...
60.9
Harm Score (HS)
Vaccine
6.3
20.475
34.65
48.825
May 23, 2026
Harm Score (HS)
False Acceptance Rate (FA)
Updated 8d ago
Evaluation Results
Method
Method
Links
Harm Score (HS)
False Acceptance Rate (FA)
Vaccine
Backbone=LLaMA3-8B
2026.05
60.9
27.4
Booster
Backbone=LLaMA3-8B
2026.05
58.1
40
RepNoise
Backbone=LLaMA3-8B
2026.05
57.2
37.9
Buffer-and-Reinforce
Backbone=LLaMA3-8B
2026.05
8.4
77.5
Feedback
Search any
task
Search any
task