RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

About

Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy56.8	906
Multitask Language Understanding	MMLU	Accuracy63.7	520
Multitask Language Understanding	C-MMLU	Accuracy (C-MMLU)50.7	16

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord