| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Relative Robustness Analysis | Combined Past Tense, OR-Bench, MMLU | R-Score78.9 | 36 | |
| Reasoning | Combined 37 Tasks (test) | Accuracy72.4 | 28 | |
| Reasoning | Combined 107 Tasks (train) | Accuracy68.8 | 28 | |
| Question Answering | Combined 7 Datasets | Average Score45 | 18 | |
| Question Answering | Combined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle | Total Score280.3 | 15 | |
| All-in-One Image Restoration | Combined (Deraining, Desnowing, Dehazing) | PSNR34.02 | 13 | |
| Bayesian neural network regression | Combined (test) | RMSE3.939 | 6 | |
| Malicious Prompt Detection | Combined All Datasets (test) | ASR4.5 | 6 | |
| Language Understanding and Reasoning | Combined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA) | Average Score72.94 | 5 | |
| Probabilistic Calibration | Combined 20K labeled samples | Brier Score0.0759 | 5 | |
| Visualization Generation | Combined (C) | Compilation Success Rate100 | 4 | |
| Data-to-text generation | Combined | FE8.05 | 3 | |
| Shadow Detection | Combined Dataset | Testing Time (hours)0.55 | 3 | |
| Ranking Method Evaluation | Combined AIME'24 AIME'25 HMMT'25 BrUMO'25 | Mean Kendall's tau_b Correlation0.962 | 1 | |
| Ranking Correlation Analysis | Combined AIME'24 AIME'25 HMMT'25 BrUMO'25 | Kendall's tau_b (vs Gold Standard)0.865 | 1 |