SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
About
Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA | Accuracy76.07 | 521 | |
| Medical Question Answering | PubMedQA | Accuracy59.2 | 117 | |
| Medical Question Answering | MMLU Med | Accuracy92.1 | 86 | |
| Medical Question Answering | BioASQ | Accuracy88.67 | 63 | |
| Medical Question Answering | MedQA US | Accuracy90.42 | 43 | |
| Health-related dialogue and decision-making | HealthBench Main | Average Score33.64 | 24 | |
| Multi-turn clinical response generation | MAQuE (test) | Accuracy61.5 | 2 |