Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GradShield: Alignment Preserving Finetuning

About

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Zhanhao Hu, Xiao Huang, Patrick Mendoza, Emad A. Alghamdi, Basel Alomair, Raluca Ada Popa, David Wagner• 2026

Related benchmarks

TaskDatasetResultRank
Safety Alignment EvaluationRTA
Utility53
9
Safety Alignment EvaluationLATharm
Utility53
9
Finetuning with implicit harmful dataIdentity-shift
Utility51
8
Safety-constrained Fine-tuningGSM8K
Utility87
7
Safety-constrained Fine-tuningAGNews
Utility91
7
Safety-constrained Fine-tuningARC Easy
Utility94
7
Safety-constrained Fine-tuningARC Challenge
Utility79
7
Showing 7 of 7 rows

Other info

Follow for update