QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

About

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan• 2025

Related benchmarks

Task	Dataset	Result
General Reasoning	MMLU-Pro	Accuracy81.33	201
Scientific Reasoning	GPQA Diamond	Accuracy76.78	62
Mathematical Reasoning	AIME 25	Accuracy86.46	26
General Reasoning	AIME 25	Accuracy87.9	21
General Reasoning	GPQA Diamond	Accuracy72.6	19
Long-context Reasoning	Long-context Reasoning Suite (test)	Average Score74.74	18
Mathematical Reasoning	AIME24	Accuracy90	12
Complex retrieval and positional sorting	MRCR 128K~512K	Score34.87	6
Complex retrieval and positional sorting	MRCR 512K~1M	Score22.53	6
Multi-hop grounding	CorpusQA 1M	Score20.72	6

Showing 10 of 18 rows

Other info

GitHub

Follow for update

@wizwand_team Discord