A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

About

Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative Policy Optimization (GRPO) train on model-generated trajectories but struggle with long-horizon credit assignment and sparse rewards; and (iii) on-policy distillation (OPD) provides dense token-level guidance but does not directly optimize task rewards. We study these complementary strategies for long-context alignment and derive a recipe that combines GRPO with OPD-style teacher guidance: the student learns from its own rollouts using outcome-level rewards, while a stronger teacher provides dense token-level regularization in place of the standard reference policy. This is especially useful when process-level supervision is difficult to obtain. To support this study, we introduce LongBlocks, a synthetic multilingual dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. Through controlled ablations, we isolate the roles of cold-start initialization, teacher anchoring, and data mixing, showing that our recipe yields a more stable and effective path to long-context reasoning than GRPO or OPD while preserving short-context capabilities.

Miguel Moura Ramos, Duarte M. Alves, Andr\'e F. T. Martins• 2026

Related benchmarks

Task	Dataset	Result	Rank
Long-context Language Understanding	LongBench	M-Avg39.6		294
Long-context evaluation	RULER	Average Accuracy Score74.8		59

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord