Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

About

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan• 2025

Related benchmarks

TaskDatasetResultRank
General ReasoningMMLU-Pro
Accuracy81.33
48
Mathematical ReasoningAIME 25
Accuracy86.46
26
Scientific ReasoningGPQA Diamond--
24
Complex retrieval and positional sortingMRCR 128K~512K
Score34.87
6
Complex retrieval and positional sortingMRCR 512K~1M
Score22.53
6
Multi-hop groundingCorpusQA 1M
Score20.72
6
Multi-hop groundingCorpusQA 4M
Score14.29
3
Agentic memoryBFCL v4
Memory Sum Component24.52
2
Dialogue memory reasoningLongMemEval
Accuracy76.4
2
Mathematical ReasoningAIME24
Accuracy90
2
Showing 10 of 10 rows

Other info

GitHub

Follow for update