| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Commonsense Reasoning | CSQA | Accuracy96 | 366 | |
| Commonsense Question Answering | CSQA (test) | Accuracy0.953 | 127 | |
| Commonsense Reasoning | CSQA | CSQA Accuracy91.2 | 126 | |
| Commonsense Reasoning | CSQA (test) | Accuracy89.4 | 111 | |
| Hallucination Detection | CSQA | AUROC85.1 | 107 | |
| Commonsense Question Answering | CSQA | Accuracy88.9 | 58 | |
| Commonsense Question Answering | CSQA | Accuracy82.72 | 44 | |
| Commonsense Reasoning | CSQA OOD (test) | Accuracy82.1 | 32 | |
| Malicious Agent | CSQA | ASR@30.49 | 28 | |
| Prompt Injection | CSQA | ASR@318.33 | 28 | |
| Retrieval-augmented Reasoning | CSQA | Accuracy85.42 | 25 | |
| Commonsense Reasoning | CSQA | Accuracy (CSQA)66.4 | 18 | |
| Commonsense Question Answering | CSQA | PIQA84.06 | 18 | |
| Question Answering | CSQA (test) | Accuracy78.5 | 18 | |
| Hallucination Detection | CSQA (CommonsenseQA) | AUROC (128 steps)84.7 | 16 | |
| Prompt Injection Defense | CSQA | ASR@313.4 | 16 | |
| Commonsense Reasoning | CSQA (dev) | Accuracy85.42 | 16 | |
| General Reasoning | CSQA | Accuracy81.3 | 15 | |
| Simple Reasoning | CSQA | Accuracy91.75 | 15 | |
| Reasoning | CSQA (leave-one-out setup) | Accuracy83.8 | 12 | |
| Commonsense Question Answering | CSQA | Accuracy85.1 | 12 | |
| Commonsense Reasoning | CSQA | Accuracy91.5 | 12 | |
| Question Answering | CSQA (in-domain) | Accuracy83.78 | 12 | |
| Commonsense Question Answering | CSQA (OOD) | Accuracy63.8 | 10 | |
| Question Answering | CSQA | Accuracy70.8 | 10 |