| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-path Speculative Decoding | Held-out (test) | Average Block Efficiency6.84 | 24 | |
| Bargaining | Held-Out (test) | Reward0.7664 | 16 | |
| Query routing and tool-calling accuracy evaluation | Held-out 12,282 examples (test) | Accuracy89.39 | 15 | |
| Tone Mapping | Held-out (test) | PSNR40.59 | 6 | |
| Clinical case generation | Held-out (test) | BLEU-418.98 | 6 | |
| Selective Classification | Held-out (test) | Coverage100 | 5 | |
| Pairwise preference ranking | Held-out | ELO Score1,187 | 5 | |
| License Plate Recognition | held-out (test) | Plate Accuracy92.3 | 5 | |
| Event-level market-impact prediction | Held-out 2021-2023 (test) | Non-neutral F135.6 | 4 | |
| Binary-level classification | held-out (test) | Accuracy98.4 | 4 | |
| binary classification | held-out n=2,332 (test) | Accuracy99.61 | 4 | |
| Supply chain disruption forecasting | Held-out (test) | Brier Score0.0791 | 4 | |
| Joint Attention detection | Held-out (test) | Accuracy77.6 | 3 | |
| Knowledge Conflict Resolution | Held-out 30 q | Accuracy76.7 | 3 |