Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

About

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy (FULL Mode)93.2	13
Natural language generation (Table-to-text, summarization)	Generation OOD	Score (Full Output)27.9	13
Structured reasoning (Code, function calling, text-to-SQL)	Structured OOD	Full Accuracy86.2	13
Weighted aggregate evaluation	All task families	Aggregate Score (All F/C)64.2	13

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord