MARS: toward more efficient multi-agent collaboration for LLM reasoning
About
Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy98 | 499 | |
| Algebraic Reasoning | AQUA | Accuracy83.07 | 61 | |
| Graduate-Level Reasoning | GPQA | Accuracy49.49 | 41 | |
| Multitask Language Understanding | MMLU | Accuracy78.63 | 34 | |
| Question Answering | GPQA | Accuracy60 | 30 | |
| Question Answering | MMLU | Accuracy85.67 | 30 | |
| Scientific Question Answering | GPQA | Average Inference Time (s)9.54 | 30 | |
| Multi-task Language Understanding | MMLU | Average Inference Time (s)7.61 | 30 | |
| Mathematical Reasoning | GSM8K | Average Inference Time (s)7.17 | 30 | |
| Aggregate Reasoning Evaluation | Multi-dataset Reasoning Suite | Average Accuracy77.55 | 12 |