| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ARC Easy | BERT-Judge | Accuracy99.7 | 257 | 1d ago | |
| MMLU | M2CL | Accuracy97.5 | 210 | 12d ago | |
| HellaSwag | Stable-LoRA | Accuracy93.59 | 196 | 17h ago | |
| ARC Challenge | Acc74.7 | 133 | 8d ago | ||
| MMLU-Pro | BERT-Judge | MMLU-Pro Overall Accuracy96.5 | 130 | 1mo ago | |
| IndoCulture native prompts (test) | Gemma2-9B | Accuracy67.5 | 99 | 3mo ago | |
| SciQ | QA | Accuracy100 | 91 | 8d ago | |
| ArabCulture 1.0 (test) | Qwen2.5-7B | Accuracy59.6 | 84 | 3mo ago | |
| OBQA | ETN | Accuracy93.2 | 79 | 1d ago | |
| MMLU 5-shot | EdgeJury | Accuracy73.4 | 73 | 2mo ago | |
| ARC-Easy (test) | TaT | Accuracy89.1 | 68 | 3mo ago | |
| World Knowledge Average of OBQA, ARC-C, ARC-E, SCIQ, SIQA | NTF | Average Accuracy87.1 | 66 | 21h ago | |
| RACE | QA | Accuracy98.24 | 64 | 1d ago | |
| PIQA | Accuracy80.5 | 63 | 1d ago | ||
| OpenBookQA (test) | Clean | Accuracy91 | 61 | 5d ago | |
| ARC Challenge (test) | TaT | Accuracy82.17 | 57 | 1mo ago | |
| MMLU | MMLU Accuracy (Overall)74.91 | 52 | 14d ago | ||
| MMLU-PRO zero-shot | Gemini 2.0 Pro Exp | Accuracy84.29 | 51 | 12d ago | |
| WinoG | Accuracy68.9 | 48 | 8d ago | ||
| ARC Challenge | NoPE | Non-generative Accuracy30 | 48 | 1mo ago | |
| MC (test) | MC Avg78 | 46 | 2mo ago | ||
| BoolQ | MC Accuracy0.887 | 46 | 4d ago | ||
| GPQA | Accuracy (%)95.2 | 44 | 1mo ago | ||
| AQuA | AgentRevive | Accuracy89.45 | 43 | 14d ago | |
| ConFiQA MC | ProbeRAG | F1 Score91.2 | 42 | 1mo ago |