Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

About

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu• 2026

Related benchmarks

Task	Dataset	Result
Reasoning	MMLU-Pro	Accuracy92.86	241
Code Generation	HumanEval	Accuracy97.56	217
Mathematics	AIME25	Accuracy63.33	103
Code Generation	LiveCodeBench v6	Accuracy100	75
Mathematics	HMMT	Accuracy53.33	32
Mathematics	Beyond	Accuracy42	26
Mathematics	AIME 26	Accuracy60	26
Multi-task Language Understanding	MMLU-Pro	Performance92.86	10
Code Generation	LiveCode Bench	Total Tokens312	10
Mathematical Reasoning	Beyond AIME	Total Tokens708	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord