MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

About

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You• 2025

Related benchmarks

Task	Dataset	Result
Multi-Agent System Performance	Database	Task Success Rate (TS)71.72	16
Multi-Agent System Performance	Research	TS Score76.3	16
Multi-Agent System Performance	Coding	TS Score63.05	16
Collaborative software engineering	MultiAgentBench Coding Graph	Task Performance57.41	6
Multi-agent Negotiation	MultiAgentBench Bargaining	Task Performance59.11	6
Multi-agent interaction and social reasoning	Werewolf MultiAgentBench	Task Performance43.28	6
Multi-agent research collaboration	MultiAgentBench Research	Task Performance69.83	6
Collaborative software engineering	MultiAgentBench Coding (Tree)	Task Performance45.79	6

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord