Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

About

With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing-positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo. Code is available at: https://github.com/UNITES-Lab/GoA.

Sukwon Yun, Jie Peng, Pingzhi Li, Wendong Fan, Jie Chen, James Zou, Guohao Li, Tianlong Chen• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1048
Code Generation	HumanEval (test)	--	701
Mathematical Reasoning	GSM8K	Accuracy (Acc)95.43	352
Reasoning	MMLU-Pro	Accuracy54.78	264
Graduate-level Question Answering	GPQA	Accuracy34.85	224
Knowledge Reasoning	MMLU-Pro	Accuracy90.24	148
Code Generation	HumanEval	HumanEval Score94.25	140
Arithmetic Reasoning	MultiArith (test)	Accuracy98.23	136
Question Answering	MedMCQA	Accuracy60.04	125
Multi-task Language Understanding	MMLU (test)	--	107

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord