MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

About

Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.

Ruiyi Yang, Zechen Li, Hao Xue, Imran Razzak, Flora D. Salim• 2026

Related benchmarks

Task	Dataset	Result
Math Reasoning	GSM8K	Accuracy92.5	254
Financial Reasoning	FinQA	--	69
Web navigation	WebShop (test)	Score90.2	36
Sequential environment decision making	Crafter BALROG protocol	Peak Task Score (%)37.9	20
Mathematical Reasoning	GSM8K 200 held-out questions	Accuracy90.4	7
Mathematical Reasoning	RealMath (200 held-out questions)	Accuracy85.1	7
Multi-hop Question Answering	HotpotQA 200 held-out questions	Accuracy91.5	7
Situation Reasoning	STBench 200 held-out questions	Accuracy57.8	7
Web-based Question Answering	WebQA (200 held-out questions)	Accuracy62.7	7
Web-based Sequential Decision Making	WebShop hundred-product catalog setup	Mean Reward90.2	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord