Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

About

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.

Haozhen Zhang, Tao Feng, Jiaxuan You• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@187.5
850
Mathematical ReasoningMATH
Accuracy76.56
535
Mathematical ReasoningAIME 2025
Accuracy10
227
Question AnsweringSQuAD 2.0
F172.45
190
Question AnsweringHotpotQA
F179.84
114
Multi-hop Question AnsweringMuSiQue--
106
Mathematical ReasoningMathQA
Accuracy82.81
95
Code GenerationAPPS
Pass@142.95
69
Multi-hop Question AnsweringBamboogle
Accuracy51.2
52
Question AnsweringTriviaQA
F181.47
46
Showing 10 of 17 rows

Other info

Follow for update