GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

About

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li• 2026

Related benchmarks

Task	Dataset	Result
General Reasoning	MMLU-Pro	Accuracy81	213
General Reasoning	GPQA Diamond	Accuracy72.3	31
General Reasoning	AIME 25	Accuracy86.9	21
Long-context Reasoning	Long-context Reasoning Suite (test)	Average Score69.8	18
Agentic memory	Memory-Vec	Accuracy21.9	5
Agentic memory	Memory-Rec_Sum	Accuracy46.5	5
Dialogue Memory	LongMemEval	Score75.2	5
General Reasoning	AIME24	Accuracy91.3	5
Agentic memory	Memory-KV	Accuracy16.1	5

Showing 9 of 9 rows

Other info

GitHub

Follow for update

@wizwand_team Discord