MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
About
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-Agent System Performance | Database | Task Success Rate (TS)71.72 | 16 | |
| Multi-Agent System Performance | Research | TS Score76.3 | 16 | |
| Multi-Agent System Performance | Coding | TS Score63.05 | 16 | |
| Collaborative software engineering | MultiAgentBench Coding Graph | Task Performance57.41 | 6 | |
| Multi-agent Negotiation | MultiAgentBench Bargaining | Task Performance59.11 | 6 | |
| Multi-agent interaction and social reasoning | Werewolf MultiAgentBench | Task Performance43.28 | 6 | |
| Multi-agent research collaboration | MultiAgentBench Research | Task Performance69.83 | 6 | |
| Collaborative software engineering | MultiAgentBench Coding (Tree) | Task Performance45.79 | 6 |