Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

medical

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language ModelingMedical (Med)
PPL Change (%) vs Baseline0.1
30
Partial Multi-label Learningmedical
Average Precision87.5
21
Partial Multi-Label Learningmedical
Ranking Loss0.03
21
Rubric satisfaction evaluationMedical
Claude-4 Sonnet Score50.9
21
Hypernym discoverymedical Gold standard domain-specific (test)
MRR77.32
18
Question AnsweringMedical
GPT Accuracy68.81
14
Preference EvaluationMedical
Avg Score8.58
14
Retrieval-Augmented GenerationMedical
Indexing Time (minutes)7
11
Importance-based Node LeakageMedical
Leakage (Deg)36.2
10
Factual Precision EvaluationMedical
SAFE87.3
10
Machine TranslationMedical (test)
BLEU55.42
9
MRI to CT translationmedical MRI→CT 256 × 256 (test)
NFE4
7
Misaligned Task LearningMedical In-domain
Misalignment3.2
6
Emergent Misalignment MeasurementMedical General Evaluation
Misalignment0.38
6
Machine TranslationMedical All-domain datastore (test)
BLEU55.1
6
Multi-label classificationMedical
Subset Accuracy27.3
5
Multi-label classificationMedical
Hamming Loss2.48
5
Multi-label classificationMedical
F-Measure40.94
5
Multi-label ClassificationMedical
Accuracy37.36
5
Access ControlMedical
Accuracy100
5
Machine TranslationMedical out-of-domain (test)
BLEU15.4
5
Mixed Linear Regressionmedical
Minimal Error (K=2)0.1591
5
Machine TranslationMedical multi-domain (test)
Decoding Throughput (Tok/Sec)3,152.59
2
DSL EvaluationMedical
Opinion4.4
1
Showing 24 of 24 rows