To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

About

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, information constraints, model prediction behavior and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/Mosi-AI/M2RL.

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFBench	IFBench Score26.19	68
Code Generation	LiveCodeBench v6	LCB.v6 Score30.29	7
General Multi-domain Reasoning	Multi-domain aggregate	Average Score39.87	7
Mathematical Reasoning	AIME 2024, AIME 2025, AMC 2023	AIME 2024 Score35.41	7
STEM Knowledge and Reasoning	GPQA Diamond MMLU Reduced	GPQA Diamond Accuracy25.25	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord