Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

About

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy50.6
954
Question AnsweringOpenBookQA
Accuracy83.25
305
Text ClassificationSST-2
Accuracy94.5
133
Math ReasoningGSM8K
Accuracy (GSM8K)88.93
131
Safety EvaluationBeavertails
ASR8.8
19
Safety EvaluationI-BeaverTails
Attack Success Rate (ASR)14.68
14
Safety EvaluationQ-LatHarmful
Attack Success Rate (ASR)7.94
14
Safety EvaluationLatHarmful
ASR8.28
14
Safety DefenseBeavertails
ASR10
7
Safety DefenseI-BeaverTails
ASR12.58
7
Showing 10 of 14 rows

Other info

Follow for update