Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
About
We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Arithmetic Reasoning | GSM8K | Accuracy93 | 155 | |
| Instruction Following | AlpacaEval | Win Rate47.3 | 125 | |
| Arithmetic Reasoning | MATH | Accuracy71 | 16 | |
| Factual recall | TriviaQA | Accuracy77 | 16 | |
| Question Answering | TriviaQA | TriviaQA Score77.3 | 11 |