| ARC Challenge | GPT-4 | Accuracy96.3 | | 749 | 2d ago |
| OpenBookQA | LMSI | Accuracy94.4 | | 465 | 3d ago |
| ARC Easy | Mistral Small 24B Inst 2501 | Accuracy98.2 | | 386 | 2d ago |
| ARC Easy | LFTF | Normalized Acc96.4 | | 385 | 3d ago |
| SQuAD v1.1 (dev) | Megatron-3.9B ensemble | F1 Score95.8 | | 375 | 2d ago |
| OBQA | Direct Fine-tuning | Accuracy94.95 | | 276 | 2d ago |
| SQuAD v1.1 (test) | LUKE | F1 Score95.4 | | 260 | 2d ago |
| GPQA | UPA | Accuracy84.2 | | 258 | 3d ago |
| ARC-E | Direct Fine-tuning | Accuracy95.23 | | 242 | 2d ago |
| BoolQ | PaLM 2-L | Accuracy90.9 | | 240 | 2d ago |
| SciQ | GPT-NeoX | Accuracy96 | | 226 | 3d ago |
| TriviaQA | RankCoT | Accuracy86.68 | | 210 | 3d ago |
| SQuAD 2.0 | RoBERTa | F189.4 | | 190 | 3d ago |
| PopQA | LogicGaze | Accuracy68.4 | | 186 | 3d ago |
| ARC-C | DRAG | Accuracy94.1 | | 166 | 2d ago |
| SQuAD v2.0 (dev) | Megatron-3.9b | F191.2 | | 158 | 2d ago |
| ARC | Yi-34B + RTD | Accuracy94.6 | | 154 | 3d ago |
| PubMedQA | Multi-Agent Medical Decision Consensus Matrix System | Accuracy83.6 | | 145 | 3d ago |
| CommonsenseQA | Entropy Equilibrium Sampling (EES) | Accuracy89.3 | | 143 | 3d ago |
| OpenBookQA (OBQA) (test) | KnowGPT | OBQA Accuracy92.4 | | 130 | 3d ago |
| SQuAD | SMP-S | F189.8 | | 127 | 3d ago |
| CommonsenseQA (CSQA) | DeBERTaV3-large + KEAR | Accuracy91.2 | | 124 | 3d ago |
| TriviaQA (test) | RankCoT | Accuracy85.18 | | 121 | 3d ago |
| TriviaQA | PaLM 2-L | EM86.1 | | 116 | 3d ago |
| HotpotQA | FlowSteer | F184.98 | | 114 | 3d ago |