Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention
About
Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks, requiring merely 16-minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual-AI/SPoT
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | IFEval (test) | IFEval Score84.8 | 88 | |
| General Artificial Intelligence Capabilities | Total Evaluation Suite (aggregate) | Average Score53.3 | 10 | |
| Mathematical Reasoning | In-domain Reasoning Suite (AIME24, AIME25, AMC23, Math500, Minerva, Olympia) (test) | AIME24 Score28 | 10 | |
| Out-of-Distribution Reasoning | OOD Reasoning Suite GPQA-D, Connect4 (test) | GPQA-D Score46.8 | 10 |