| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Language Evaluation | Aggregated MMLU, BoolQ, OpenBookQA, RTE | Average Accuracy70.4 | 22 | |
| Feature Selection | Aggregated AL, CH, CO, EY, GE, HE, HI, HO, JA, MI, OT, YE | Rank2.17 | 17 | |
| General Language Proficiency | Aggregated GSM8K, TruthfulQA, TriviaQA, CNN/DM, MMLU | Average Score48.6 | 9 | |
| General Performance | Aggregated MMLU, HellaSwag, TruthfulQA, GSM8K, MATH, MBPP, HumanEval | Average Score40.35 | 9 | |
| Disentanglement | Aggregated | InfoM0.76 | 8 | |
| Faithfulness Diagnosticity | Aggregated SST, Ev.Inf, AG, and M.RC | Alpha Score0.525 | 4 | |
| Instance-level search | Aggregated Mean All & Mean R1M (test) | Mean All0.601 | 2 |