Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

About

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Jinhuan Song, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate47.3	420
Arithmetic Reasoning	GSM8K	Accuracy93	272
Arithmetic Reasoning	MATH	Accuracy71	39
Factual recall	TriviaQA	Accuracy77	16
Question Answering	TriviaQA	TriviaQA Score77.3	11

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord