Self-Distillation Enables Continual Learning
About
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA | Accuracy81.6 | 791 | |
| Mathematical Reasoning | HMMT 2025 | -- | 194 | |
| Mathematical Reasoning | AIME 2024 | Mean Score (k=8)63.3 | 81 | |
| Mathematical Reasoning | Math Benchmarks Aggregate | -- | 44 | |
| Tool Use | ToolAlpaca | Tool Use Success Rate73.5 | 26 | |
| Tool Use | tool-use (test) | Accuracy62.8 | 24 | |
| Mathematical Reasoning | Math Reasoning AIME24, AIME25, HMMT25 | AIME24 Score75.9 | 24 | |
| Mathematical Reasoning | AMO-Bench | Pass@826 | 20 | |
| Preference-Aligned Reasoning | MMLU-Pro | Level Score33.4 | 16 | |
| Preference-Aligned Reasoning | MATH | Level40.6 | 16 |