Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

About

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He• 2025

Related benchmarks

TaskDatasetResultRank
Medical Question AnsweringMedQA (test)
Tau (τ)0.901
15
Medical Question AnsweringMedMCQA
Tau Correlation0.928
13
Expert Preference PairwiseLMArena Expert Med
Kendall's tau_b0.476
3
General Preference PairwiseLMArena Med
Kendall's Tau_b0.594
3
Medical Note SummarizationMEDIC Note Summ
Kendall's tau_b-0.275
3
Medical Safety EvaluationMEDIC MedSafety
Kendall's tau_b0.552
3
Open-Ended Medical EvaluationMEDIC Open-Ended
Kendall's tau_b0.485
3
Expert Medical Knowledge MCQMedXpertQA
Kendall's tau_b0.747
3
Expert Preference PairwiseMedArena
Kendall's Tau-b1
3
Holistic Medical TasksMedHELM All
Kendall's tau_b0.733
3
Showing 10 of 13 rows

Other info

Follow for update