Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Explanation Quality Evaluation on MMLU-CK (test)
Loading...
44
Reasoning Soundness Loss (%)
PubMed Reasoner
43.952
44.276
44.6
44.924
Mar 28, 2026
Reasoning Soundness Loss (%)
Reasoning Soundness Tie (%)
Reasoning Soundness Win (%)
Reasoning Soundness Avg Likert
Evidence Grounding Loss (%)
Evidence Grounding Tie (%)
Evidence Grounding Win (%)
Evidence Grounding Avg Likert
Clinical Relevance Loss (%)
Clinical Relevance Tie (%)
Clinical Relevance Win (%)
Clinical Relevance Avg Likert
Trustworthiness Loss (%)
Trustworthiness Tie (%)
Trustworthiness Win (%)
Trustworthiness Avg Likert
Updated 2mo ago
Evaluation Results
Method
Method
Links
Reasoning Soundness Loss (%)
Reasoning Soundness Tie (%)
Reasoning Soundness Win (%)
Reasoning Soundness Avg Likert
Evidence Grounding Loss (%)
Evidence Grounding Tie (%)
Evidence Grounding Win (%)
Evidence Grounding Avg Likert
Clinical Relevance Loss (%)
Clinical Relevance Tie (%)
Clinical Relevance Win (%)
Clinical Relevance Avg Likert
Trustworthiness Loss (%)
Trustworthiness Tie (%)
Trustworthiness Win (%)
Trustworthiness Avg Likert
PubMed Reasoner
2026.03
44
10.8
45.2
3.699
25.2
11.7
64.1
3.595
34.4
20
45.6
3.732
35.3
15.9
48.8
3.712
Gemini
retrieval=w/o retrieval
2026.03
45.2
10.8
44
3.307
64.1
11.7
25.2
3.209
45.6
20
34.4
3.525
48.8
15.9
35.3
3.386
Feedback
Search any
task
Search any
task