| RealHitBench | DeepSeek-R1 | Exact Match (EM)70.31 | | 66 | 1mo ago |
| Credit | NumCoKE | Hit Rate @ 174.5 | | 24 | 1mo ago |
| Spotify | NumCoKE | Hit Rate @163.6 | | 24 | 1mo ago |
| US-Cities | NumCoKE | H@143.9 | | 24 | 1mo ago |
| TaTQA | | Accuracy92.9 | | 14 | 2mo ago |
| TableBench (test) | Qwen3-8B | Accuracy64.48 | | 13 | 13d ago |
| Countdown-4 | Self-Aware Markov Models | CD498.9 | | 13 | 2mo ago |
| NSR-1K | DivCon | Precision85.41 | | 8 | 2mo ago |
| HRS | DivCon | Precision78.65 | | 8 | 2mo ago |
| GSM8K (test) | DEL | Accuracy (Error <= 1)70 | | 6 | 13d ago |
| GSM8K (test) | EMO | MAE (Scale 1)1.13 | | 6 | 13d ago |
| DROP (dev) | POET-SQL_T5 | EM85.2 | | 6 | 2mo ago |
| DROP (test) | CONE | EM83.74 | | 4 | 2mo ago |
| SVAMP (test) | POET-SQL_T5 | Exact Match (EM)57.4 | | 4 | 3mo ago |
| NUPA (aggregated) | NumValue-RNN | Exact Match72.4 | | 4 | 3mo ago |
| DROP numerical reasoning Football 500 randomly sampled cases (test) | Least-to-Most | Accuracy63.4 | | 4 | 3mo ago |
| DROP numerical reasoning Non-football 500 randomly sampled cases (test) | Least-to-Most | Accuracy74.2 | | 4 | 3mo ago |
| EQUATE (test) | POET-SQL | Exact Match67.5 | | 4 | 3mo ago |
| TAT-QA (dev) | POET-SQL | Exact Match (EM)59.1 | | 4 | 3mo ago |
| HotpotQA (test) | POET-SQL | EM68.7 | | 4 | 3mo ago |
| DROP span-subset (dev) | POET-SQL | EM79.8 | | 4 | 3mo ago |