SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
About
Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA | -- | 791 | |
| Hallucination Detection | HaluEval | -- | 131 | |
| Hallucination Evaluation | HaluEval | -- | 51 | |
| Preservation of General Capabilities | HellaSwag, WinoGrande, IFEval, MMLU | HellaSwag Delta5.5 | 44 | |
| Forgetting-aware Instruction Tuning | Magicoder Stability and Plasticity suites (test) | ARC-C53.45 | 36 | |
| Language Adaptation | Galician | Win-Tie94 | 31 | |
| General Knowledge Preservation | General Capability Suite HS WG IFEval MMLU | HS Delta7.9 | 22 | |
| Knowledge Acquisition | TOFU author-profile questions (held-out) | Task Accuracy69.5 | 22 | |
| Confidence calibration | Qwen3-4B Calibration | Brier Delta12.7 | 10 | |
| Hallucination Detection | HaluEval | HaluEval Delta11.2 | 10 |