Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Refusal behavior defense on WizardLM (test)

90.4BadNet CACC

Inst_clean

45.05656.82868.680.372Jan 7, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.01
90.42.890.48.790.43.290.41.4
2026.01
893.287.673.483.54.188.10.5
2026.01
85.316.186.787.278.48389.91.4
2026.01
81.2088.91.8780.976.60.5
2026.01
70.242.287.688.57885.858.335.8
2026.01
69.360.68585.87693.659.455
2026.01
57.83.257.83357.85.557.80.9
2026.01
551.853.21.852.83.748.22.3
2026.01
52.369.351.882.1508351.427.5
2026.01
50.587.652.386.943.185.834.950.5
2026.01
48.687.648.687.24586.716.587
2026.01
46.84548.688.548.285.813.482.6