Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

About

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You• 2025

Related benchmarks

TaskDatasetResultRank
Multi-Agent System PerformanceDatabase
Task Success Rate (TS)71.72
16
Multi-Agent System PerformanceResearch
TS Score76.3
16
Multi-Agent System PerformanceCoding
TS Score63.05
16
Collaborative software engineeringMultiAgentBench Coding Graph
Task Performance57.41
6
Multi-agent NegotiationMultiAgentBench Bargaining
Task Performance59.11
6
Multi-agent interaction and social reasoningWerewolf MultiAgentBench
Task Performance43.28
6
Multi-agent research collaborationMultiAgentBench Research
Task Performance69.83
6
Collaborative software engineeringMultiAgentBench Coding (Tree)
Task Performance45.79
6
Showing 8 of 8 rows

Other info

Code

Follow for update