Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MARS: toward more efficient multi-agent collaboration for LLM reasoning

About

Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy98
499
Algebraic ReasoningAQUA
Accuracy83.07
61
Graduate-Level ReasoningGPQA
Accuracy49.49
41
Multitask Language UnderstandingMMLU
Accuracy78.63
34
Question AnsweringGPQA
Accuracy60
30
Question AnsweringMMLU
Accuracy85.67
30
Scientific Question AnsweringGPQA
Average Inference Time (s)9.54
30
Multi-task Language UnderstandingMMLU
Average Inference Time (s)7.61
30
Mathematical ReasoningGSM8K
Average Inference Time (s)7.17
30
Aggregate Reasoning EvaluationMulti-dataset Reasoning Suite
Average Accuracy77.55
12
Showing 10 of 11 rows

Other info

Follow for update