Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

About

Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.

Parjanya Prajakta Prashant, Jiongli Zhu, Aldan Creo, Babak Salimi• 2026

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA--
791
Hallucination DetectionHaluEval--
131
Hallucination EvaluationHaluEval--
51
Preservation of General CapabilitiesHellaSwag, WinoGrande, IFEval, MMLU
HellaSwag Delta4.2
44
Language AdaptationGalician
Win-Tie98.5
31
Knowledge AcquisitionTOFU author-profile questions (held-out)
Task Accuracy84.3
22
General Knowledge PreservationGeneral Capability Suite HS WG IFEval MMLU
HS Delta2.9
22
Science Question AnsweringScience QA
Task Accuracy65
10
Knowledge AcquisitionKnowledge Acquisition
Task Accuracy83.3
10
Confidence calibrationQwen3-4B Calibration
Brier Delta9.1
10
Showing 10 of 13 rows

Other info

Follow for update