Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration

About

Large Language Models (LLMs) have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks. Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability. In this work, we let a single model "step outside the box" by engaging multiple models to correct each other. We introduce a multi-agent collaboration strategy that emulates the academic peer review process. Each agent independently constructs its own solution, provides reviews on the solutions of others, and assigns confidence levels to its reviews. Upon receiving peer reviews, agents revise their initial solutions. Extensive experiments on three different types of reasoning tasks show that our collaboration approach delivers superior accuracy across all ten datasets compared to existing methods. Further study underscores the effectiveness of integrating confidence in reviews, demonstrates the superiority of feedback exchange over mere solution sharing, and highlights the role of capability and diversity in fostering successful collaboration.

Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, Yuxiang Wu• 2023

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	--	906
Mathematical Reasoning	MATH	Accuracy51.2	882
Long-context Language Understanding	LongBench	M-Avg50.21	294
Science Question Answering	ARC-C	--	268
Graduate-level Question Answering	GPQA	Accuracy32.4	224
Question Answering	SQuAD	Exact Match87.67	83
Language Understanding	MMLU	RA77.33	31
Long-context Understanding	LongBench	Average Context Length (tokens)8.15e+5	16
Mathematical Reasoning	MATH	Avg Context Length (tokens)8.85e+3	16
Multi-task Language Understanding	MMLU-Pro	Average Context Length (tokens)1.49e+4	16

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord