| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | AVG. | EM47.5 | 28 | |
| change detection | Avg across SYSU, LEVIR, GVLM, CLCD, OSCD | Precision84.8 | 23 | |
| Low-Light Image Enhancement | AVG. DICM, MEF, LIME, NPE, VV | NIQE3.589 | 17 | |
| Reasoning Performance (Aggregate) | AVG | TPF351 | 14 | |
| Question Answering | AVG. Aggregate of NQ, TQA, HQA, 2WIKI (test) | EM42.5 | 14 | |
| Selective classification | Avg (all) | AURC (10^-2 Scale)0.215 | 11 | |
| Selective classification | Avg 1K | AURC (Scale 10^-2)0.248 | 11 | |
| Detection | AVG | AUC0.897 | 10 | |
| Multi-task Language Understanding | AVG Across All Benchmarks | Throughput12.89 | 8 | |
| Scene Text Recognition | AVG 12 benchmarks | Word Accuracy91.33 | 8 | |
| Machine Translation | Avg (News, Flores, Subtitle, Travel) German-English (test aggregate) | DA87.77 | 4 |