Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

About

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, information constraints, model prediction behavior and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/Mosi-AI/M2RL.

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFBench
IFBench Score26.19
56
Code GenerationLiveCodeBench v6
LCB.v6 Score30.29
7
General Multi-domain ReasoningMulti-domain aggregate
Average Score39.87
7
Mathematical ReasoningAIME 2024, AIME 2025, AMC 2023
AIME 2024 Score35.41
7
STEM Knowledge and ReasoningGPQA Diamond MMLU Reduced
GPQA Diamond Accuracy25.25
7
Showing 5 of 5 rows

Other info

Follow for update