Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

About

With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing-positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo. Code is available at: https://github.com/UNITES-Lab/GoA.

Sukwon Yun, Jie Peng, Pingzhi Li, Wendong Fan, Jie Chen, James Zou, Guohao Li, Tianlong Chen• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
1043
Code GenerationHumanEval (test)--
612
Mathematical ReasoningGSM8K
Accuracy (Acc)95.43
337
ReasoningMMLU-Pro
Accuracy54.78
241
Graduate-level Question AnsweringGPQA
Accuracy34.85
215
Code GenerationHumanEval
HumanEval Score94.25
128
Knowledge ReasoningMMLU-Pro
Accuracy90.24
120
Arithmetic ReasoningMultiArith (test)
Accuracy98.23
115
Question AnsweringMedMCQA
Accuracy60.04
98
ReasoningGPQA
Accuracy56.57
88
Showing 10 of 29 rows

Other info

Follow for update