| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| RealHitBench | DeepSeek-R1 | Exact Match (EM)70.31 | 66 | 3d ago | |
| TaTQA | Accuracy92.9 | 14 | 1mo ago | ||
| Countdown-4 | Self-Aware Markov Models | CD498.9 | 13 | 1mo ago | |
| NSR-1K | DivCon | Precision85.41 | 8 | 1mo ago | |
| HRS | DivCon | Precision78.65 | 8 | 1mo ago | |
| DROP (dev) | POET-SQL_T5 | EM85.2 | 6 | 1mo ago | |
| DROP (test) | CONE | EM83.74 | 4 | 1mo ago | |
| SVAMP (test) | POET-SQL_T5 | Exact Match (EM)57.4 | 4 | 1mo ago | |
| NUPA (aggregated) | NumValue-RNN | Exact Match72.4 | 4 | 1mo ago | |
| DROP numerical reasoning Football 500 randomly sampled cases (test) | Least-to-Most | Accuracy63.4 | 4 | 1mo ago | |
| DROP numerical reasoning Non-football 500 randomly sampled cases (test) | Least-to-Most | Accuracy74.2 | 4 | 1mo ago | |
| EQUATE (test) | POET-SQL | Exact Match67.5 | 4 | 1mo ago | |
| TAT-QA (dev) | POET-SQL | Exact Match (EM)59.1 | 4 | 1mo ago | |
| HotpotQA (test) | POET-SQL | EM68.7 | 4 | 1mo ago | |
| DROP span-subset (dev) | POET-SQL | EM79.8 | 4 | 1mo ago |