End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

About

The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians' average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Jian Zhang, Yanfeng Wang, Ya Zhang, Weidi Xie• 2025

Related benchmarks

Task	Dataset	Result
Rare Disease Diagnosis	RareBench MME	R@142.5	21
Rare Disease Diagnosis	DDD	Recall@139.42	21
Rare Disease Diagnosis	MyGene	R@130.14	21
Rare Disease Diagnosis	RareBench HMS	Recall@136.36	21
Rare Disease Diagnosis	RareBench LIRICAL	R@129.46	21
Rare Disease Diagnosis	MIMIC-IV Rare	R@112.75	21
Rare Disease Diagnosis	RareBench RAMEDIS	Recall@128.57	21

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord