| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | CSQA | Accuracy96 | 366 | |
| Commonsense Question Answering | CSQA (test) | Accuracy0.953 | 127 | |
| Commonsense Reasoning | CSQA (test) | Accuracy89.4 | 111 | |
| Hallucination Detection | CSQA | AUROC72.47 | 55 | |
| Commonsense Question Answering | CSQA | Accuracy82.72 | 44 | |
| Malicious Agent | CSQA | ASR@30.49 | 28 | |
| Prompt Injection | CSQA | ASR@318.33 | 28 | |
| Retrieval-augmented Reasoning | CSQA | Accuracy85.42 | 25 | |
| Commonsense Reasoning | CSQA | CSQA Accuracy91.2 | 21 | |
| Commonsense Question Answering | CSQA | PIQA84.06 | 18 | |
| Question Answering | CSQA (test) | Accuracy78.5 | 18 | |
| Prompt Injection Defense | CSQA | ASR@313.4 | 16 | |
| Commonsense Reasoning | CSQA (dev) | Accuracy85.42 | 16 | |
| Simple Reasoning | CSQA | Accuracy91.75 | 15 | |
| Commonsense Question Answering | CSQA | Accuracy85.1 | 12 | |
| Commonsense Reasoning | CSQA | Accuracy91.5 | 12 | |
| Question Answering | CSQA (in-domain) | Accuracy83.78 | 12 | |
| Commonsense Question Answering | CSQA (OOD) | Accuracy63.8 | 10 | |
| Multiple Choice Question Answering | CSQA (dev) | Accuracy71.1 | 10 | |
| Ranking correlation with full dataset evaluation | CSQA | Kendall Correlation0.83 | 10 | |
| Commonsense Reasoning | CSQA | PIQA84.98 | 9 | |
| Question Answering | CSQA | µbias0.9 | 8 | |
| Multiple Choice Question Answering | CSQA (test) | Accuracy82.2 | 8 | |
| Scaling Law Prediction | CSQA | MAE0.0255 | 7 | |
| Question Answering | CSQA | Accuracy69.2 | 7 |