A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement
About
Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA | Accuracy76.5 | 521 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy94.5 | 384 | |
| Reasoning | MMLU-Pro | Accuracy82.02 | 241 | |
| Code Generation | HumanEval | Accuracy95.12 | 217 | |
| Reasoning | GPQA Diamond | Accuracy65.15 | 185 | |
| Scientific Question Answering | GPQA Diamond | Accuracy66.16 | 123 | |
| Instruction Following | IFEval | Accuracy (IFEval)90 | 89 | |
| Code Generation | LiveCodeBench | Accuracy52.17 | 84 | |
| Mathematical Problem Solving | MATH500 | Accuracy92.6 | 83 | |
| Multi-task Language Understanding | MMLU-Pro | Accuracy82.05 | 64 |