MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

About

LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs. Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability. Code will be available at https://github.com/rui-ye/MAS-GPT.

Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy68.7	882
Code Generation	HumanEval+	--	393
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)93.4	358
Question Answering	GPQA	Accuracy37.6	258
Reasoning	AIME 24	Accuracy26.39	58
Reasoning	GPQA (80% test)	Accuracy43.98	39
Out-of-Domain Reasoning	GPQA	Avg@8 Accuracy63.51	9
Mathematical Reasoning	AIME 24	Avg@8 Accuracy58.75	9
Multi-Agent Reasoning	AIME 24	Calls228	9
Multi-Agent Reasoning	GPQA	Calls1.52e+3	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord