MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
About
Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math Reasoning | GSM8K | Accuracy92.5 | 254 | |
| Financial Reasoning | FinQA | -- | 69 | |
| Web navigation | WebShop (test) | Score90.2 | 16 | |
| Sequential environment decision making | Crafter BALROG protocol | Peak Task Score (%)37.9 | 8 | |
| Mathematical Reasoning | GSM8K 200 held-out questions | Accuracy90.4 | 7 | |
| Mathematical Reasoning | RealMath (200 held-out questions) | Accuracy85.1 | 7 | |
| Multi-hop Question Answering | HotpotQA 200 held-out questions | Accuracy91.5 | 7 | |
| Situation Reasoning | STBench 200 held-out questions | Accuracy57.8 | 7 | |
| Web-based Question Answering | WebQA (200 held-out questions) | Accuracy62.7 | 7 | |
| Web-based Sequential Decision Making | WebShop hundred-product catalog setup | Mean Reward90.2 | 6 |