Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

About

Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar• 2026

Related benchmarks

Task	Dataset	Result
Clinical Diagnosis	RareBench (combined)	Recall@139.31	7
Clinical Diagnosis	RareBench HMS subset n=88	Recall@151.14	7
Clinical diagnosis retrieval	RareBench MME n=40	R@140	7
Diagnosis	DiagnosisArena	Top-1 Accuracy36.36	7
Clinical Diagnosis	RareBench LIRICAL	Recall@135.69	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord