| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-SQL | Science Benchmark | Execution Accuracy59.53 | 28 | |
| Named Entity Recognition | Science | F1 Score79.4 | 19 | |
| Task Routing | Science | Cost ($)0.0276 | 15 | |
| Taxonomy Expansion | Science single-parent hierarchies (test) | R@146.1 | 13 | |
| Multi-label Classification | Science | Ranking Loss0.9277 | 11 | |
| Multi-label Feature Selection | Science | Macro F1 Score12.39 | 11 | |
| Multi-label Feature Selection | Science | CV Score25.122 | 11 | |
| Multi-label Feature Selection | Science (test) | HL3.44 | 11 | |
| Multi-label feature selection | Science | OE Score96 | 11 | |
| Multi-label Feature Selection | Science | AP5.26 | 11 | |
| Taxonomy Expansion | Science (SCI) SemEval-2016 Task 13 | Chi-Squared13.2 | 10 | |
| Scientific Reasoning | Science GPQA Diamond HLE (test) | GPQA Diamond Score63.1 | 6 | |
| Science Reasoning | Science (out-of-distribution) | Accuracy65.12 | 6 | |
| Task-Efficient Routing | Science Curated Task Benchmark 1.0 (test) | Average Cost0.0054 | 3 | |
| Taxonomy Expansion | Science | Prec@144.7 | 3 | |
| Named Entity Recognition | Science English | F1 Score62.29 | 2 |