Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
About
Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedQA (test) | Tau (τ)0.901 | 15 | |
| Medical Question Answering | MedMCQA | Tau Correlation0.928 | 13 | |
| Expert Preference Pairwise | LMArena Expert Med | Kendall's tau_b0.476 | 3 | |
| General Preference Pairwise | LMArena Med | Kendall's Tau_b0.594 | 3 | |
| Medical Note Summarization | MEDIC Note Summ | Kendall's tau_b-0.275 | 3 | |
| Medical Safety Evaluation | MEDIC MedSafety | Kendall's tau_b0.552 | 3 | |
| Open-Ended Medical Evaluation | MEDIC Open-Ended | Kendall's tau_b0.485 | 3 | |
| Expert Medical Knowledge MCQ | MedXpertQA | Kendall's tau_b0.747 | 3 | |
| Expert Preference Pairwise | MedArena | Kendall's Tau-b1 | 3 | |
| Holistic Medical Tasks | MedHELM All | Kendall's tau_b0.733 | 3 |