Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Explanation Quality Evaluation on PubMedQA (test)
Loading...
39.7
Reasoning Soundness Loss
PubMed Reasoner
39.104
43.127
47.15
51.173
Mar 28, 2026
Reasoning Soundness Loss
Reasoning Soundness Tie Rate
Reasoning Soundness Win Rate
Reasoning Soundness Score
Evidence Grounding Loss
Evidence Grounding Tie Rate
Evidence Grounding Win Rate
Evidence Grounding Score
Clinical Relevance Loss
Clinical Relevance Tie Rate
Clinical Relevance Win Rate
Clinical Relevance Score
Trustworthiness Loss
Trustworthiness Tie Rate
Trustworthiness Win Rate
Trustworthiness Score
Updated 2mo ago
Evaluation Results
Method
Method
Links
Reasoning Soundness Loss
Reasoning Soundness Tie Rate
Reasoning Soundness Win Rate
Reasoning Soundness Score
Evidence Grounding Loss
Evidence Grounding Tie Rate
Evidence Grounding Win Rate
Evidence Grounding Score
Clinical Relevance Loss
Clinical Relevance Tie Rate
Clinical Relevance Win Rate
Clinical Relevance Score
Trustworthiness Loss
Trustworthiness Tie Rate
Trustworthiness Win Rate
Trustworthiness Score
PubMed Reasoner
2026.03
39.7
5.7
54.6
3.584
40.8
3.8
55.4
3.601
39.2
7.4
53.4
3.584
38.7
5.4
55.9
3.587
Gemini
retrieval=w/o retrieval
2026.03
54.6
5.7
39.7
3.416
55.4
3.8
40.8
3.421
53.4
7.4
39.2
3.438
55.9
5.4
38.7
3.424
Feedback
Search any
task
Search any
task