| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | Medical | GPT Accuracy68.81 | 31 | |
| Language Modeling | Medical (Med) | PPL Change (%) vs Baseline0.1 | 30 | |
| Partial Multi-label Learning | medical | Average Precision87.5 | 21 | |
| Partial Multi-Label Learning | medical | Ranking Loss0.03 | 21 | |
| Rubric satisfaction evaluation | Medical | Claude-4 Sonnet Score50.9 | 21 | |
| Hypernym discovery | medical Gold standard domain-specific (test) | MRR77.32 | 18 | |
| Medical Task | Medical | Accuracy100 | 16 | |
| Image Quality Assessment | Medical | PLCC0.871 | 15 | |
| Summarization | Medical Random subset | R-LCS25.04 | 14 | |
| Medical Question Answering | Medical | Score81.55 | 14 | |
| Preference Evaluation | Medical | Avg Score8.58 | 14 | |
| Budgeted Hybrid Routing | Medical Average Global | Spearman Correlation1 | 12 | |
| Budgeted Hybrid Routing | Medical Ru→En | HitRate@p100 | 12 | |
| Budgeted Hybrid Routing | Medical Zh→En | HitRate@p100 | 12 | |
| Budgeted Hybrid Routing | Medical En→Ru | Hit Rate@p100 | 12 | |
| Budgeted Hybrid Routing | Medical En→Zh | HitRate@p100 | 12 | |
| Retrieval-Augmented Generation | Medical | Indexing Time (minutes)7 | 11 | |
| Multi-label classification | Medical | Micro F1-Score76.6 | 11 | |
| Classification | Medical | F1 Score76.6 | 10 | |
| Importance-based Node Leakage | Medical | Leakage (Deg)36.2 | 10 | |
| Factual Precision Evaluation | Medical | SAFE87.3 | 10 | |
| Machine Translation | Medical (test) | BLEU55.42 | 9 | |
| Summarization | Medical (OOV_SD) | R-LCS26.68 | 8 | |
| Multi-label Feature Selection | medical | Hamming Loss0.011 | 7 | |
| Multi-label Feature Selection | medical | Running Time (sec)0.11 | 7 |