| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Relative Robustness Analysis | Combined Past Tense, OR-Bench, MMLU | R-Score78.9 | 36 | |
| Reasoning | Combined 37 Tasks (test) | Accuracy72.4 | 28 | |
| Reasoning | Combined 107 Tasks (train) | Accuracy68.8 | 28 | |
| Question Answering | Combined 7 Datasets | Average Score45 | 18 | |
| Harmful prompt detection | Combined Average | F1 Score (Combined Average)90.18 | 17 | |
| Question Answering | Combined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle | Total Score280.3 | 15 | |
| All-in-One Image Restoration | Combined (Deraining, Desnowing, Dehazing) | PSNR34.02 | 13 | |
| Subject-Level Detection | Combined (ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA) | Accuracy78.23 | 12 | |
| Segment-Level Classification | Combined ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA | Accuracy68.03 | 12 | |
| Bayesian neural network regression | Combined (test) | RMSE3.925 | 12 | |
| General Video Understanding | Combined (VideoMME, LVBench, LongVideoBench, EgoSchema, MLVU) | Average Score64.9 | 11 | |
| Negative Concept Suppression | Combined LLM-generated + COCO-derived | Suppression Rate85.25 | 10 | |
| Multi-turn attack detection | Combined LMSYS, SafeDialBench, Synthetic (held-out) | Detection Accuracy99 | 10 | |
| Malicious Prompt Detection | Combined All Datasets (test) | ASR4.5 | 6 | |
| Language Understanding and Reasoning | Combined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA) | Average Score72.94 | 5 | |
| Probabilistic Calibration | Combined 20K labeled samples | Brier Score0.0759 | 5 | |
| Classification | Combined (label-stratified) | AUROC0.986 | 4 | |
| Zero-shot Language Understanding | Combined Zero-shot | Average Accuracy63.71 | 4 | |
| Visualization Generation | Combined (C) | Compilation Success Rate100 | 4 | |
| Brain Tumor Classification | Combined 4 datasets | Accuracy99.5 | 3 | |
| Data-to-text generation | Combined | FE8.05 | 3 | |
| Shadow Detection | Combined Dataset | Testing Time (hours)0.55 | 3 | |
| Landmark Detection | Combined | MRE1.02 | 2 | |
| Attack Detection | Combined (label-stratified) | AUROC97.1 | 1 | |
| Ranking Method Evaluation | Combined AIME'24 AIME'25 HMMT'25 BrUMO'25 | Mean Kendall's tau_b Correlation0.962 | 1 |