Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

About

Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emph{Retrospective Clinical Episodes} that retrieve problem-solving experiences from similar past cases, (2)~an \emph{Adaptive Procedural Heuristics} bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emph{Tool Reliability Controller} that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.

Weixiang Shen, Bailiang Jian, Jun Li, Che Liu, Johannes Moll, Xiaobin Hu, Daniel Rueckert, Hongwei Bran Li, Jiazhen Pan• 2026

Related benchmarks

Task	Dataset	Result
Clinical Task Execution	MedAgentBench OOD v2	Accuracy86.7	35
Clinical Task Execution	MedAgentBench (val)	Accuracy62.5	35
Clinical Task Execution	MedAgentBench (test)	Accuracy67.8	35
Clinical Task Execution	MedAgentBench OOD	Accuracy31.3	35
Clinical Task Execution	MedAgentBench v2 (test)	Accuracy69.3	35
Clinical Task Execution	MedAgentBench v2 (val)	Accuracy68.8	35

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord