Token-level Data Selection for Safe LLM Fine-tuning
About
Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy59.03 | 1891 | |
| Multitask Language Understanding | MMLU | Accuracy63.07 | 413 | |
| Safety Evaluation | HEX-PHI | HEx-PHI Score93.79 | 162 | |
| Safety Evaluation | Anthropic HH (test) | Safety Score88.85 | 24 | |
| Utility Evaluation | SLIMORCA (test) | Score68.85 | 24 | |
| Safety Evaluation | HEX-PHI | Harmfulness Score2.06 | 16 | |
| Safety Evaluation | HEX-PHI | Safety Score (HEx-PHI)69.87 | 10 | |
| Win Rate Evaluation | SLIMORCA (test) | Win Rate88.12 | 8 | |
| Win Rate Evaluation | Anthropic HH (test) | Win Rate88.82 | 6 | |
| Mathematical Reasoning | GSM8K | Win Rate62.97 | 3 |