SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

About

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy50.6	954
Question Answering	OpenBookQA	Accuracy83.25	319
Math Reasoning	GSM8K	Accuracy (GSM8K)88.93	190
Text Classification	SST-2	Accuracy94.5	133
Safety Evaluation	Beavertails	ASR8.8	44
Safety Evaluation	I-BeaverTails	Attack Success Rate (ASR)14.68	14
Safety Evaluation	Q-LatHarmful	Attack Success Rate (ASR)7.94	14
Safety Evaluation	LatHarmful	ASR8.28	14
Safety Defense	Beavertails	ASR10	7
Safety Defense	I-BeaverTails	ASR12.58	7

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord