Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

About

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at https://github.com/LLLeoLi/LARF.

Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha• 2025

Related benchmarks

TaskDatasetResultRank
Safety EvaluationHarmBench
ASR24.5
148
Safety EvaluationDirectHarm 4
Attack Success Rate15
87
Safety EvaluationHEX-PHI
Attack Success Rate (ASR)8.62
87
Jailbreak attack success rateHarmBench
Attack Success Rate (Generated)9.5
52
Attack Success RateHEX-PHI
Attack Success Rate0.00e+0
48
Attack Success RateDirectHarm4
Attack Success Rate19.25
48
Safety EvaluationAdvBench Safety Evaluation
ASR (S1)1.35
42
Safety EvaluationCategoricalHarmfulQA Alpaca fine-tuning (test)
ASR Delta (S1-S5)-1.63
42
Safety EvaluationHarmBench
ASR7.5
39
Safety EvaluationCategoricalHarmfulQA Dolly fine-tuning (test)
ASR (S1)1.27
21
Showing 10 of 12 rows

Other info

Follow for update