| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ARC Challenge | GPT-4 | Accuracy96.3 | 906 | 16d ago | |
| ARC Easy | Mistral Small 24B Inst 2501 | Accuracy98.2 | 597 | 2d ago | |
| OpenBookQA | LMSI | Accuracy94.4 | 465 | 1mo ago | |
| ARC-E | Direct Fine-tuning | Accuracy95.23 | 416 | 4d ago | |
| ARC Easy | LFTF | Normalized Acc96.4 | 389 | 1mo ago | |
| SQuAD v1.1 (dev) | Megatron-3.9B ensemble | F1 Score95.8 | 380 | 12d ago | |
| PIQA | Mashup Learning | Accuracy86.5 | 374 | 4d ago | |
| BoolQ | PaLM 2-L | Accuracy90.9 | 317 | 2d ago | |
| OBQA | Direct Fine-tuning | Accuracy94.95 | 300 | 1mo ago | |
| SciQ | MSSRfull | Accuracy97.2 | 283 | 10d ago | |
| SQuAD v1.1 (test) | LUKE | F1 Score95.4 | 260 | 1mo ago | |
| GPQA | UPA | Accuracy84.2 | 258 | 1mo ago | |
| TriviaQA | RankCoT | Accuracy86.68 | 238 | 19d ago | |
| ARC | Yi-34B + RTD | Accuracy94.6 | 230 | 1mo ago | |
| ARC-C | DRAG | Accuracy94.1 | 192 | 23d ago | |
| SQuAD 2.0 | RoBERTa | F189.4 | 190 | 4d ago | |
| PopQA | LogicGaze | Accuracy68.4 | 186 | 1mo ago | |
| TriviaQA | PaLM 2-L | EM86.1 | 182 | 11d ago | |
| SQuAD v2.0 (dev) | Megatron-3.9b | F191.2 | 163 | 12d ago | |
| 2WIKI | F191.79 | 152 | 10d ago | ||
| TruthfulQA | LLaMA-3.1-8B | Accuracy86.6 | 152 | 4d ago | |
| CommonsenseQA | Entropy Equilibrium Sampling (EES) | Accuracy89.3 | 148 | 1mo ago | |
| PubMedQA | Multi-Agent Medical Decision Consensus Matrix System | Accuracy83.6 | 145 | 1mo ago | |
| ARC Challenge | Frozen LLM graph | Accuracy (ARC)87.3 | 142 | 2d ago | |
| SQuAD | SMP-S | F189.8 | 134 | 4d ago |