| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | PubMedQA (test) | Accuracy82.4 | 170 | |
| Question Answering | PubMedQA | Accuracy83.6 | 145 | |
| Medical Question Answering | PubMedQA | Accuracy81.4 | 117 | |
| Medical Question Answering | PubMedQA | Accuracy82.8 | 65 | |
| Question Answering | PubMedQA PQA-L (test) | Accuracy87.08 | 45 | |
| Biomedical Question Answering | PubMedQA | Attack Accuracy77 | 40 | |
| Hallucination Detection | PubmedQA | F1 Score88 | 36 | |
| Multiple Choice Question Answering | PubMedQA (test) | Accuracy76.03 | 34 | |
| Medical Question Answering | PubMedQA | Pass@186 | 32 | |
| Medical Question Answering | PubMedQA | Factual Accuracy (FA)95.63 | 28 | |
| Language Modeling | PubMedQA MdQ | PPL Change (%) vs Baseline0 | 24 | |
| Question Answering | PubMedQA | EM79.82 | 18 | |
| Question Answering | PubMedQA long-context (PQA-L) | Macro-F161.1 | 17 | |
| Prompt Leakage Attack | PubMedQA | ASR (500)14 | 16 | |
| Question Answering | PubMedQA | Recall@189.8 | 15 | |
| Multiple-choice Question Answering | PubMedQA | Accuracy63.62 | 15 | |
| Question Answering | PubMedQA | Context Influence115.78 | 15 | |
| Question Answering | PubMedQA | Accuracy82.2 | 15 | |
| Selective Generation | PubMedQA | PRR (ROUGE-L)0.372 | 14 | |
| Question Answering | PubMedQA (out-of-domain) | ROUGE-L11.7 | 14 | |
| Medical Reasoning | PubMedQA | Accuracy78.3 | 13 | |
| Biomedical Question Answering | PubMedQA | Accuracy68.32 | 13 | |
| Speculative Decoding Inference | PubMedQA | Throughput (tokens/s)182.24 | 12 | |
| Medical Reasoning | PubMedQA | Token Cost (tokens/question)1,509 | 11 | |
| Biomedical Question Answering | PubMedQA PQA-L In-Domain (test) | Accuracy78 | 11 |