| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Out-of-Domain Reasoning Aggregation | OOD Average | Accuracy63.57 | 22 | |
| Speech Emotion Recognition | Four OOD (test) | Macro-F1 Delta1.57 | 21 | |
| Speculative decoding evaluation | OOD Mean | Speedup5.21 | 20 | |
| Out-of-Distribution Detection | OOD datasets | pAUROC@2094.2 | 17 | |
| Unsupervised Object Segmentation | OOD 1.0 (test) | FG-ARI7,824 | 16 | |
| LLM Routing | OOD | Accuracy89 | 11 | |
| OOD Detection | OOD | AUC (Confidence)0.822 | 9 | |
| Mathematical and Scientific Reasoning | OOD AIME, HMMT, GPQA, MMLU-Pro, MMLU-Redux 2.0 | Pass@189.5 | 8 | |
| Language Modeling | OOD | Loss1.285 | 7 | |
| Diffusion-generated time series detection | Avg. OOD Aggregate of TSDiff, Diffusion-TS, WaveStitch (summary) | F1 Score84.8 | 6 | |
| Detoxification | OOD | TP Score54 | 6 | |
| Classification | OOD | Accuracy65.71 | 6 | |
| Speculative Decoding | OOD | Block Efficiency2.13 | 5 | |
| Defective Dialog Detection | OOD Shopping n = 105 (test) | Precision48 | 5 | |
| Unsupervised image annotation | OOD set | NMI0.54 | 5 | |
| Referential Communication | OOD set | Accuracy92.7 | 5 | |
| Safe Robot Navigation | OOD Case II: high obstacle density (30 obstacles) | SR44.2 | 4 | |
| Image Denoising | OOD Average | PSNR39.94 | 4 | |
| MR Image Quality Transfer | OOD | SSIM82.19 | 4 | |
| STL-conditioned Robotic Planning | OOD-3 Layout | Success Rate (OOD-3 All)23 | 4 | |
| STL-conditioned Robotic Planning | OOD-2 Layout | OOD-2 Success Rate (All)12.88 | 4 | |
| Open-ended Dialogue | OOD Average | Win Rate60.5 | 4 | |
| Table Understanding | OOD Table S2 (test) | ROUGE-L40.38 | 4 | |
| Table Understanding | OOD Table S1 (test) | Accuracy80.2 | 4 | |
| Synthetic Face Detection | OOD (Out-of-Distribution) | ECE0.0516 | 3 |