Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

About

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He• 2025

Related benchmarks

Task	Dataset	Result
Medical Question Answering	MedQA (test)	--	20
Medical Question Answering	MedMCQA	Tau Correlation0.928	13
Expert Preference Pairwise	LMArena Expert Med	Kendall's tau_b0.476	3
General Preference Pairwise	LMArena Med	Kendall's Tau_b0.594	3
Medical Note Summarization	MEDIC Note Summ	Kendall's tau_b-0.275	3
Medical Safety Evaluation	MEDIC MedSafety	Kendall's tau_b0.552	3
Open-Ended Medical Evaluation	MEDIC Open-Ended	Kendall's tau_b0.485	3
Expert Medical Knowledge MCQ	MedXpertQA	Kendall's tau_b0.747	3
Expert Preference Pairwise	MedArena	Kendall's Tau-b1	3
Holistic Medical Tasks	MedHELM All	Kendall's tau_b0.733	3

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord