Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Token-level Data Selection for Safe LLM Fine-tuning

About

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.

Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy59.03
1891
Multitask Language UnderstandingMMLU
Accuracy63.07
413
Safety EvaluationHEX-PHI
HEx-PHI Score93.79
162
Safety EvaluationAnthropic HH (test)
Safety Score88.85
24
Utility EvaluationSLIMORCA (test)
Score68.85
24
Safety EvaluationHEX-PHI
Harmfulness Score2.06
16
Safety EvaluationHEX-PHI
Safety Score (HEx-PHI)69.87
10
Win Rate EvaluationSLIMORCA (test)
Win Rate88.12
8
Win Rate EvaluationAnthropic HH (test)
Win Rate88.82
6
Mathematical ReasoningGSM8K
Win Rate62.97
3
Showing 10 of 12 rows

Other info

Follow for update