Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

About

With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, Haifeng Xu• 2025

Related benchmarks

Task	Dataset	Result
Medical Question Answering	MedMCQA	Accuracy69.67	591
Multiple-choice Question Answering	HellaSwag	Accuracy82.67	212
Question Answering	MedMCQA	Accuracy63.67	125
Toxicity Detection	Toxicity Detection 64 model-persona combinations (8 models x 8 personas)	Win Count50	56
Question Answering	CSQA	Accuracy86	42
Question Answering	MMLU Pro.Med.	Accuracy92.28	42
Question Answering	HH-RLHF	Accuracy56.67	22
Question Answering	MMLU Formal Logic (test)	Accuracy64.29	22
Multi-Agent Reasoning	UltraFeedback	Accuracy73.66	9
Multi-Agent Reasoning	ARMMAN	Accuracy85.78	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord