Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

About

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun• 2025

Related benchmarks

TaskDatasetResultRank
Arithmetic ReasoningGSM8K
Accuracy93
155
Instruction FollowingAlpacaEval
Win Rate47.3
125
Arithmetic ReasoningMATH
Accuracy71
16
Factual recallTriviaQA
Accuracy77
16
Question AnsweringTriviaQA
TriviaQA Score77.3
11
Showing 5 of 5 rows

Other info

Follow for update