Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

About

Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.

Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, Jingnan Gu• 2026

Related benchmarks

TaskDatasetResultRank
Long-context ReasoningLongBench v2
Average Score36.5
88
Long-context language modelingRULER
Accuracy (8K Context)96.29
75
Structured reasoning (Code, function calling, text-to-SQL)Structured OOD
Full Accuracy87.8
13
Natural language generation (Table-to-text, summarization)Generation OOD
Score (Full Output)27.7
13
Weighted aggregate evaluationAll task families
Aggregate Score (All F/C)64
13
Mathematical ReasoningMATH
Accuracy (FULL Mode)86
13
Comprehensive long-context evaluationRULER and LongBench V2
Total Average Score64.93
12
Showing 7 of 7 rows

Other info

Follow for update