The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

About

Proprietary giants are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers -- a simple recipe that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter -- the number of clusters.

Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu• 2025

Related benchmarks

Task	Dataset	Result
Multi-turn dialogue	ShareGPT	Success Rate (SR)83.93	24
Multi-turn dialogue	ShareGPT, JDDC, and MedDG Aggregated	SRavg78	24
Multi-turn dialogue	JDDC	Success Rate (SR)78.39	24
Multi-turn dialogue	MedDG	Success Rate (SR)71.69	24
Interactive Decision-making	ScienceWorld (test)	Score36.8	14
LLM Routing	MedMCQA	Top-1 Acc84.8	14
LLM Routing	MedMCQA (val)	Top-1 Acc92.7	14
Interactive Decision-making	ScienceWorld (OOD)	Score2.4	14
LLM Routing	MMLU-Pro	Top-1 Acc78.6	14
LLM Routing	SuperGPQA	Top-1 Acc51.7	14

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord