Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safety Evaluation on AdvBench (Adversarial Attack Metrics)

100Overall Safety Score

Post-hoc (LlamaGuard)

93.999295.557197.11598.6729Sep 15, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.09
10037.546.1557.598.6585.7760.9692.3172.36
2025.09
10053.2728.6594.8110069.046597.8876.08
2025.09
10032.8827.6968.6510061.1559.2399.6268.65
2025.09
10087.1264.2397.3110081.7388.6599.6289.83
2025.09
10090.3885.1999.6210092.1285.1999.4293.99
2025.09
1007575.9684.6210086.5470.7799.6286.56
2025.09
10097.594.2399.4210098.8595.1999.6298.1
2025.09
10016.356.9231.5443.655.5814.2317.8829.52
2025.09
10029.8151.3549.2398.8587.3163.0892.1271.47
2025.09
10054.6228.6594.6284.4234.0462.8894.8169.25
2025.09
10021.1520.3864.2398.8534.8142.8897.8860.02
2025.09
10085.9664.2396.7398.0874.6289.0498.0888.34
2025.09
10090.9687.6999.6296.1565.5881.7399.0490.1
2025.09
10096.3594.0498.8510098.2794.2399.4297.65
2025.09
99.9274.2463.8395.9299.9284.4174.9599.0286.53
2025.09
99.975.6162.4395.9195.765.3670.6598.1382.96
2025.09
99.8168.0863.6581.3599.0465.9659.2398.6579.47
2025.09
99.6888.5773.3897.4699.692.5890.0697.9992.42
2025.09
99.5847.9648.1970.8399.1965.2253.9297.2772.77
2025.09
99.5756.0656.9373.2999.8386.2165.8998.9779.59
2025.09
99.3989.8574.197.4499.6594.9389.7299.193.02
2025.09
99.2363.8540.3870.5848.6531.7323.6567.3155.67
2025.09
99.2375.7776.5482.6996.7397.1268.8598.6586.95
2025.09
98.4676.5475.3879.8197.596.7371.5497.6986.71
2025.09
98.1652.8363.7561.2397.5492.7269.3487.9377.94
2025.09
95.8536.0830.0633.5257.4734.4528.5836.7944.1
2025.09
95.8546.7864.9948.697.0894.8667.0390.275.67
2025.09
95.3821.3520.3846.9242.6946.9239.6227.3142.57
2025.09
94.7139.5742.5252.3449.5563.5951.9438.0754.04
2025.09
94.2363.6554.4270.9644.6267.1250.9672.1264.76