Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Distillation Enables Continual Learning

About

Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

Idan Shenfeld, Mehul Damani, Jonas H\"ubotter, Pulkit Agrawal• 2026

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA
Accuracy81.6
791
Mathematical ReasoningHMMT 2025--
194
Mathematical ReasoningAIME 2024
Mean Score (k=8)63.3
81
Mathematical ReasoningMath Benchmarks Aggregate--
44
Tool UseToolAlpaca
Tool Use Success Rate73.5
26
Tool Usetool-use (test)
Accuracy62.8
24
Mathematical ReasoningMath Reasoning AIME24, AIME25, HMMT25
AIME24 Score75.9
24
Mathematical ReasoningAMO-Bench
Pass@826
20
Preference-Aligned ReasoningMMLU-Pro
Level Score33.4
16
Preference-Aligned ReasoningMATH
Level40.6
16
Showing 10 of 31 rows

Other info

GitHub

Follow for update