| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Modeling | Medical (Med) | PPL Change (%) vs Baseline0.1 | 30 | |
| Partial Multi-label Learning | medical | Average Precision87.5 | 21 | |
| Partial Multi-Label Learning | medical | Ranking Loss0.03 | 21 | |
| Rubric satisfaction evaluation | Medical | Claude-4 Sonnet Score50.9 | 21 | |
| Hypernym discovery | medical Gold standard domain-specific (test) | MRR77.32 | 18 | |
| Question Answering | Medical | GPT Accuracy68.81 | 14 | |
| Preference Evaluation | Medical | Avg Score8.58 | 14 | |
| Retrieval-Augmented Generation | Medical | Indexing Time (minutes)7 | 11 | |
| Importance-based Node Leakage | Medical | Leakage (Deg)36.2 | 10 | |
| Factual Precision Evaluation | Medical | SAFE87.3 | 10 | |
| Machine Translation | Medical (test) | BLEU55.42 | 9 | |
| MRI to CT translation | medical MRI→CT 256 × 256 (test) | NFE4 | 7 | |
| Misaligned Task Learning | Medical In-domain | Misalignment3.2 | 6 | |
| Emergent Misalignment Measurement | Medical General Evaluation | Misalignment0.38 | 6 | |
| Machine Translation | Medical All-domain datastore (test) | BLEU55.1 | 6 | |
| Multi-label classification | Medical | Subset Accuracy27.3 | 5 | |
| Multi-label classification | Medical | Hamming Loss2.48 | 5 | |
| Multi-label classification | Medical | F-Measure40.94 | 5 | |
| Multi-label Classification | Medical | Accuracy37.36 | 5 | |
| Access Control | Medical | Accuracy100 | 5 | |
| Machine Translation | Medical out-of-domain (test) | BLEU15.4 | 5 | |
| Mixed Linear Regression | medical | Minimal Error (K=2)0.1591 | 5 | |
| Machine Translation | Medical multi-domain (test) | Decoding Throughput (Tok/Sec)3,152.59 | 2 | |
| DSL Evaluation | Medical | Opinion4.4 | 1 |