End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
About
The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians' average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Rare Disease Diagnosis | RareBench MME | R@142.5 | 21 | |
| Rare Disease Diagnosis | DDD | Recall@139.42 | 21 | |
| Rare Disease Diagnosis | MyGene | R@130.14 | 21 | |
| Rare Disease Diagnosis | RareBench HMS | Recall@136.36 | 21 | |
| Rare Disease Diagnosis | RareBench LIRICAL | R@129.46 | 21 | |
| Rare Disease Diagnosis | MIMIC-IV Rare | R@112.75 | 21 | |
| Rare Disease Diagnosis | RareBench RAMEDIS | Recall@128.57 | 21 |