Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

About

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen• 2026

Related benchmarks

TaskDatasetResultRank
LLM RoutingAlpaca In-Domain
AUROC0.7202
7
LLM RoutingBig Math In-Domain
AUROC0.6618
7
LLM RoutingMMLU In-Domain
AUROC0.6788
7
LLM RoutingMagpie Out-of-Domain
AUROC74.08
7
LLM RoutingMATH Out-of-Domain
AUROC73.9
7
LLM RoutingMMLU Pro Out-of-Domain
STEM Score65.32
7
LLM RoutingMMLU In-Domain
LPM78.51
7
LLM RoutingMagpie Out-of-Domain
LPM Score63.53
7
LLM RoutingMATH Out-of-Domain
LPM Score69.24
7
LLM RoutingMMLU Pro STEM (Out-of-Domain)
LPM59.2
7
Showing 10 of 13 rows

Other info

Follow for update