| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Jailbreak Defense | Hard (H) | FPR0 | 12 | |
| Classification | HARD (test) | Accuracy97.77 | 8 | |
| Online Learning | HARD | Latency (s)0.2516 | 8 | |
| RO reformulation | Hard (Out-of-Distribution) | Accuracy94.8 | 6 | |
| Speech Separation | Hard (test) | SI-SDR (dB)9.31 | 4 | |
| Reasoning over Large Structured Context | Hard | ReasoningJudge Score5 | 4 | |
| Joint Audio-Video Generation | Hard (test) | Sync-C6.12 | 4 | |
| Online Bin Packing | Hard28-R | Gap Percentage8.06 | 4 | |
| First Integral Discovery | Hard (test) | Accuracy63.7 | 2 |