| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | PubMedQA | Accuracy83.6 | 145 | |
| Question Answering | PubMedQA (test) | Accuracy82.4 | 128 | |
| Medical Question Answering | PubMedQA | Accuracy81.4 | 92 | |
| Question Answering | PubMedQA PQA-L (test) | Accuracy78.2 | 43 | |
| Hallucination Detection | PubmedQA | F1 Score88 | 36 | |
| Medical Question Answering | PubMedQA | Factual Accuracy (FA)95.63 | 28 | |
| Language Modeling | PubMedQA MdQ | PPL Change (%) vs Baseline0 | 24 | |
| Question Answering | PubMedQA | EM79.82 | 18 | |
| Prompt Leakage Attack | PubMedQA | ASR (500)14 | 16 | |
| Multiple-choice Question Answering | PubMedQA | Accuracy63.62 | 15 | |
| Question Answering | PubMedQA | Context Influence115.78 | 15 | |
| Question Answering | PubMedQA | Accuracy82.2 | 15 | |
| Medical Question Answering | PubMedQA | Pass@186 | 14 | |
| Question Answering | PubMedQA (out-of-domain) | ROUGE-L11.7 | 14 | |
| Medical Reasoning | PubMedQA | Accuracy78.3 | 13 | |
| Biomedical Question Answering | PubMedQA | Accuracy68.32 | 13 | |
| Medical Reasoning | PubMedQA | Token Cost (tokens/question)1,509 | 11 | |
| Biomedical Question Answering | PubMedQA PQA-L In-Domain (test) | Accuracy78 | 11 | |
| Medical Question Answering | PubMedQA | Accuracy78.4 | 10 | |
| Medical Question Answering | PubMedQA | Kendall's Tau4.03 | 10 | |
| Close-ended QA | PubMedQA | Accuracy85 | 10 | |
| Medical Question Answering | PubMedQA Reasoning Required | Accuracy82 | 10 | |
| Domain Adaptation | PubMedQA | PPL Delta (%)8.3 | 9 | |
| Language Modeling | PubMedQA | PPL Change (%)8.3 | 9 | |
| Multiple choice QA | PubMedQA (test) | AUROC81.8 | 9 |