Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Malicious Fine-tuning Defense on BeaverTails (test)

1Harmfulness Score

DeepAlign

0.85841.81422.773.7258Jul 27, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.07
10
2025.07
1.010
2025.07
1.060.3
2025.07
1.143.9
2025.07
1.260
2025.07
1.263.3
2025.07
1.570
2025.07
1.570
2025.07
1.7712.7
2025.07
2.1414.2
2025.07
2.140
2025.07
2.140
2025.07
2.3915.1
2025.07
2.4316.6
2025.07
2.570
2025.07
2.6615.1
2025.07
2.6721.2
2025.07
2.8818.1
2025.07
2.9718.1
2025.07
3.0234.2
2025.07
3.1726.7
2025.07
3.2333.3
2025.07
3.2326.7
2025.07
3.3843.3
2025.07
3.4342.7
2025.07
3.5340
2025.07
3.5850
2025.07
3.6346.7
2025.07
3.733.3
2025.07
3.846.7
2025.07
3.8646.7
2025.07
3.8653.3
2025.07
3.9756.7
2025.07
3.9742.4
2025.07
3.9742.4
2025.07
450
2025.07
4.0350
2025.07
4.1366.7
2025.07
4.1760
2025.07
4.2363.3
2025.07
4.356.7
2025.07
4.360
2025.07
4.5280.3
2025.07
4.5480