Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

About

Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks, requiring merely 16-minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual-AI/SPoT

Wenye Lin, Kai Han• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval (test)
IFEval Score84.8
88
General Artificial Intelligence CapabilitiesTotal Evaluation Suite (aggregate)
Average Score53.3
10
Mathematical ReasoningIn-domain Reasoning Suite (AIME24, AIME25, AMC23, Math500, Minerva, Olympia) (test)
AIME24 Score28
10
Out-of-Distribution ReasoningOOD Reasoning Suite GPQA-D, Connect4 (test)
GPQA-D Score46.8
10
Showing 4 of 4 rows

Other info

GitHub

Follow for update