MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand
About
Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induced errors. We propose MetaForge, a multimodal agent framework that learns when to invoke tools and how to evolve its toolset on demand. MetaForge factorizes agentic behavior into four coupled stages: Decide (judging whether tool use is warranted), Retrieve (selecting suitable tools), Adapt (grounding tool parameters in task context), and Forge (synthesizing new skills online and recycling them into the tool library for reuse), forming a closed judge-retrieve-adapt-forge-recycle loop. A unified orchestration policy enables the agent to choose among answering directly, reusing existing tools, or forging new ones. We jointly optimize invocation necessity, retrieval accuracy, execution effectiveness, and forged-skill reusability via reinforcement learning, with an explicit invocation-cost penalty discouraging redundant calls. Across 12 benchmarks, MetaForge consistently surpasses 16 baselines in accuracy, efficiency, and generalization, validating a paradigm shift from static tool inventories to on-demand self-evolution.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA | -- | 791 | |
| Mathematical Reasoning | MathVista | Score79.78 | 474 | |
| Diagram Question Answering | AI2D | AI2D Accuracy89.6 | 387 | |
| Counting | TallyQA | Accuracy81.17 | 67 | |
| Chart Question Answering | ChartQA | Accuracy90.49 | 59 | |
| OCR-based Visual Question Answering | OCRVQA | Mean Accuracy86.51 | 50 | |
| Document Visual Question Answering | DocVQA v1.0 (test) | -- | 49 | |
| Tool Use | VerlTool IID Tools | Att.190 | 11 | |
| Tool Use | VerlTool OOD Tools | Attribute12 | 11 | |
| Visual Question Answering | MapQA | Accuracy89.14 | 9 |