Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

About

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at https://github.com/LLLeoLi/LARF.

Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha• 2025

Related benchmarks

Task	Dataset	Result
Safety Evaluation	HarmBench	ASR24.5	153
Safety Evaluation	HEX-PHI	Attack Success Rate (ASR)8.62	107
Safety Evaluation	DirectHarm 4	Attack Success Rate15	87
Attack Success Rate	HEX-PHI	Attack Success Rate0.00e+0	63
Jailbreak attack success rate	HarmBench	Attack Success Rate (Generated)9.5	55
Attack Success Rate	DirectHarm4	Attack Success Rate19.25	54
Safety Evaluation	AdvBench Safety Evaluation	ASR (S1)1.35	42
Safety Evaluation	CategoricalHarmfulQA Alpaca fine-tuning (test)	ASR Delta (S1-S5)-1.63	42
Safety Evaluation	HarmBench	ASR7.5	39
Mathematical Reasoning and Safety Preservation	GSM8K	HH Safety Score65.52	24

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord