Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PubMedQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringPubMedQA (test)
Accuracy82.4
170
Question AnsweringPubMedQA
Accuracy83.6
145
Medical Question AnsweringPubMedQA
Accuracy81.4
117
Medical Question AnsweringPubMedQA
Accuracy82.8
65
Question AnsweringPubMedQA PQA-L (test)
Accuracy87.08
45
Biomedical Question AnsweringPubMedQA
Attack Accuracy77
40
Hallucination DetectionPubmedQA
F1 Score88
36
Multiple Choice Question AnsweringPubMedQA (test)
Accuracy76.03
34
Medical Question AnsweringPubMedQA
Pass@186
32
Medical Question AnsweringPubMedQA
Factual Accuracy (FA)95.63
28
Language ModelingPubMedQA MdQ
PPL Change (%) vs Baseline0
24
Question AnsweringPubMedQA
EM79.82
18
Question AnsweringPubMedQA long-context (PQA-L)
Macro-F161.1
17
Prompt Leakage AttackPubMedQA
ASR (500)14
16
Question AnsweringPubMedQA
Recall@189.8
15
Multiple-choice Question AnsweringPubMedQA
Accuracy63.62
15
Question AnsweringPubMedQA
Context Influence115.78
15
Question AnsweringPubMedQA
Accuracy82.2
15
Selective GenerationPubMedQA
PRR (ROUGE-L)0.372
14
Question AnsweringPubMedQA (out-of-domain)
ROUGE-L11.7
14
Medical ReasoningPubMedQA
Accuracy78.3
13
Biomedical Question AnsweringPubMedQA
Accuracy68.32
13
Speculative Decoding InferencePubMedQA
Throughput (tokens/s)182.24
12
Medical ReasoningPubMedQA
Token Cost (tokens/question)1,509
11
Biomedical Question AnsweringPubMedQA PQA-L In-Domain (test)
Accuracy78
11
Showing 25 of 63 rows