QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
About
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Reasoning | MMLU-Pro | Accuracy81.33 | 201 | |
| Scientific Reasoning | GPQA Diamond | Accuracy76.78 | 62 | |
| Mathematical Reasoning | AIME 25 | Accuracy86.46 | 26 | |
| General Reasoning | AIME 25 | Accuracy87.9 | 21 | |
| General Reasoning | GPQA Diamond | Accuracy72.6 | 19 | |
| Long-context Reasoning | Long-context Reasoning Suite (test) | Average Score74.74 | 18 | |
| Mathematical Reasoning | AIME24 | Accuracy90 | 12 | |
| Complex retrieval and positional sorting | MRCR 128K~512K | Score34.87 | 6 | |
| Complex retrieval and positional sorting | MRCR 512K~1M | Score22.53 | 6 | |
| Multi-hop grounding | CorpusQA 1M | Score20.72 | 6 |