| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Scientific Reasoning | TheoremQA | Accuracy82.49 | 68 | |
| Mathematical Reasoning | TheoremQA | Accuracy49.25 | 64 | |
| Theorem-based Reasoning | TheoremQA | Score53 | 34 | |
| Reasoning Quality Assessment | TheoremQA | AUROC0.873 | 32 | |
| Science and Engineering Question Answering | TheoremQA | Accuracy68.04 | 31 | |
| Physics | TheoremQA | Accuracy58.8 | 28 | |
| Mathematical Reasoning | TheoremQA (test) | Accuracy48.4 | 28 | |
| Targeted error generation | TheoremQA Tier-1 (first-20 sweep) | Targeted Error Rate54 | 27 | |
| Mathematical Reasoning | TheoremQA | Pass@124.7 | 18 | |
| Mathematical Reasoning | TheoremQA | Pass@134.1 | 18 | |
| STEM Reasoning | TheoremQA | Avg@255.4 | 16 | |
| Question Answering | TheoremQA | Accuracy15 | 16 | |
| Mathematical Reasoning | TheoremQA | ThmQA Score57.88 | 15 | |
| Reasoning | TheoremQA | AUROC88.87 | 14 | |
| Theorem-based Question Answering | TheoremQA | Accuracy86.32 | 13 | |
| Theorem Proving | TheoremQA | Accuracy13.5 | 13 | |
| STEM Theorem Question Answering | TheoremQA | Acceptance Length4.4 | 12 | |
| Mathematical Problem Solving | TheoremQA TQ-Math | Exact Match Accuracy57.7 | 12 | |
| Retrieval-Augmented Generation | TheoremQA | Accuracy66.3 | 12 | |
| Theorem Question Answering | TheoremQA standard (test) | Accuracy56 | 12 | |
| Skill retrieval | TheoremQA | nDCG@177.4 | 11 | |
| Coding | TheoremQA | Accuracy55.38 | 10 | |
| General Reasoning | TheoremQA | Accuracy (General Reasoning)32.47 | 9 | |
| Scientific Reasoning | TheoremQA (test) | Accuracy48.4 | 9 | |
| Theorem Question Answering | TheoremQA (test) | Accuracy87.4 | 8 |