SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection
About
Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy50.6 | 954 | |
| Question Answering | OpenBookQA | Accuracy83.25 | 305 | |
| Text Classification | SST-2 | Accuracy94.5 | 133 | |
| Math Reasoning | GSM8K | Accuracy (GSM8K)88.93 | 131 | |
| Safety Evaluation | Beavertails | ASR8.8 | 19 | |
| Safety Evaluation | I-BeaverTails | Attack Success Rate (ASR)14.68 | 14 | |
| Safety Evaluation | Q-LatHarmful | Attack Success Rate (ASR)7.94 | 14 | |
| Safety Evaluation | LatHarmful | ASR8.28 | 14 | |
| Safety Defense | Beavertails | ASR10 | 7 | |
| Safety Defense | I-BeaverTails | ASR12.58 | 7 |