| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| We-Math (test) | GPT-4o | S1 Score72.8 | 20 | 3mo ago | |
| GSM8K (test) | TCP-MCP | Accuracy96.61 | 14 | 6d ago | |
| MMLU-Pro (held-out test) | LLM-Debate | Accuracy83.67 | 14 | 6d ago | |
| MMLU (test) | LLM-Debate | Accuracy90.84 | 14 | 6d ago | |
| DART 5 | Berr. Latent | Accuracy54 | 5 | 2mo ago |