| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | GPQA | Accuracy84.2 | 258 | |
| Science Reasoning | GPQA | Accuracy95.1 | 218 | |
| Graduate-level Question Answering | GPQA | Accuracy96.9 | 114 | |
| Reasoning | GPQA Diamond | Accuracy91.9 | 88 | |
| Science Question Answering | GPQA | pass@1 Accuracy87.6 | 85 | |
| Scientific Question Answering | GPQA Diamond | Accuracy84.4 | 64 | |
| Closed-ended reasoning | GPQA Diamond (test) | Accuracy65.9 | 63 | |
| Question Answering | GPQA Diamond | Accuracy66.7 | 62 | |
| Scientific Reasoning | GPQA | Accuracy83.8 | 55 | |
| Question Answering | GPQA (test) | Accuracy45.5 | 55 | |
| Scientific Reasoning | GPQA | Accuracy45.5 | 50 | |
| Question Answering | GPQA Diamond | Pass@175.7 | 49 | |
| Scientific Reasoning | GPQA Diamond | Accuracy87.5 | 45 | |
| Science Reasoning | GPQA (test) | Accuracy64.44 | 41 | |
| Expert-level Question Answering | GPQA Diamond | Pass@166.65 | 39 | |
| Reasoning | GPQA | Accuracy55.05 | 38 | |
| Science Reasoning | GPQA | Pass@169.7 | 35 | |
| Knowledge | GPQA | Accuracy59.39 | 34 | |
| Question Answering | GPQA Diamond v1 (test) | Avg@586.7 | 32 | |
| Scientific Reasoning | GPQA Diamond (test) | Accuracy99.37 | 32 | |
| Scientific Reasoning | GPQA Diamond | Pass@164.58 | 32 | |
| Reasoning | GPQA-D | Accuracy59.47 | 29 | |
| Graduate-Level Reasoning | GPQA | Accuracy78.9 | 29 | |
| Science Question Answering | GPQA | Accuracy34.8 | 28 | |
| Scientific QA | GPQA-Diamond | Final Accuracy99.3 | 28 |