| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| TriviaQA | Task Calibration | BAS47 | 24 | 22d ago | |
| SimpleQA | MBR | BAS0.2 | 24 | 22d ago | |
| Unified Abstention Benchmark Suite (MMLU, GSM-MC, UMWP, Knowledge Crosswords, HellaSwag, Quail, Misconceptions, Propaganda, BBQ) | TRACE INVERSION | MMLU Accuracy91.5 | 24 | 2mo ago | |
| SELFAWARE | Abstain-R1 | U-Ref91.4 | 7 | 1mo ago | |
| ABSTAIN-QA | DeepSeek-R1 | Accuracy (A)83.4 | 7 | 1mo ago | |
| ABSTAIN (test) | DeepSeek-R1 | Accuracy78.6 | 7 | 1mo ago |