Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

About

Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically investigate the factors influencing ensemble performance, identifying model performance, vocabulary size, and response style as key determinants, revealing that compatibility among models is essential for effective ensembling. This analysis leads to the development of a simple yet effective model selection strategy that identifies compatible models. Additionally, we introduce the \textsc{Uni}on \textsc{T}op-$k$ \textsc{E}nsembling (\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.

Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy79.98	1424
Instruction Following	IFEval	--	854
Mathematical Reasoning	MATH	Accuracy19.57	535
Code Generation	HumanEval	pass@161.59	329
Arithmetic Reasoning	GSM8K (test)	Accuracy77.8	203
General Reasoning	BBH	BBH General Reasoning Accuracy35.53	117
Reasoning	ARC-C	--	113
Language Understanding	MMLU	MMLU Score71.94	98
Mathematical Reasoning	MAWPS (test)	Accuracy92.8	87
Relation Extraction	CoNLL 04	F130.85	85

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord