| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Seen vs. Unseen Datasets (dataset-level) | CoDeC | AUC100 | 56 | 21d ago | |
| SAT | F1 Score79 | 16 | 2mo ago | ||
| K&K | Min-K% | F1 Score70 | 16 | 2mo ago | |
| AIME 2025 | Self-Critique | F1 Score76 | 16 | 2mo ago | |
| AIME 2024 | Self-Critique | F1 Score76 | 16 | 2mo ago | |
| MATH-500 (test) | Reference Score S42.6 | 12 | 12d ago | ||
| GSM8K (test) | S_ref53.68 | 12 | 12d ago | ||
| Omni-MATH Dataset C | ZCP | Score (Reference)23.22 | 8 | 12d ago | |
| DETCON Logical Reasoning | CDD | Accuracy70.6 | 7 | 3mo ago | |
| DETCON Code Generation | CDD | Accuracy71.5 | 7 | 3mo ago | |
| Multi-domain Data Dataset U | ZCP | Reference Score S16.38 | 4 | 12d ago | |
| Omni-MATH (Dataset U) | ZCP | Reference Score (S)15.85 | 4 | 12d ago | |
| Titanic | - | - | 0 | 2mo ago | |
| synthetic | - | - | 0 | 2mo ago | |
| mushroom | - | - | 0 | 2mo ago | |
| iris | - | - | 0 | 2mo ago | |
| gamma | - | - | 0 | 2mo ago |