| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Speech Emotion Recognition | Four OOD (test) | Macro-F1 Delta1.57 | 21 | |
| Speculative decoding evaluation | OOD Mean | Speedup5.21 | 20 | |
| Unsupervised Object Segmentation | OOD 1.0 (test) | FG-ARI7,824 | 16 | |
| LLM Routing | OOD | Accuracy89 | 11 | |
| OOD Detection | OOD | AUC (Confidence)0.822 | 9 | |
| Language Modeling | OOD | Loss1.285 | 7 | |
| Classification | OOD | Accuracy65.71 | 6 | |
| Speculative Decoding | OOD | Block Efficiency2.13 | 5 | |
| Defective Dialog Detection | OOD Shopping n = 105 (test) | Precision48 | 5 | |
| Unsupervised image annotation | OOD set | NMI0.54 | 5 | |
| Referential Communication | OOD set | Accuracy92.7 | 5 | |
| Image Denoising | OOD Average | PSNR39.94 | 4 | |
| MR Image Quality Transfer | OOD | SSIM82.19 | 4 | |
| STL-conditioned Robotic Planning | OOD-3 Layout | Success Rate (OOD-3 All)23 | 4 | |
| STL-conditioned Robotic Planning | OOD-2 Layout | OOD-2 Success Rate (All)12.88 | 4 | |
| Open-ended Dialogue | OOD Average | Win Rate60.5 | 4 | |
| Table Understanding | OOD Table S2 (test) | ROUGE-L40.38 | 4 | |
| Table Understanding | OOD Table S1 (test) | Accuracy80.2 | 4 | |
| Mapless Navigation | OOD Physical Track (a) 1/10th scale (test) | Lap Time (s)9.56 | 3 | |
| Binary classification (Human vs Machine speech) | Overall OOD (test) | Accuracy97.4 | 1 |