| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | M2CL | Accuracy97.5 | 148 | 3d ago | |
| ARC Easy | PAROQ | Accuracy84.3 | 122 | 3d ago | |
| MMLU-Pro | TopoDIM | MMLU-Pro Overall Accuracy84.8 | 116 | 3d ago | |
| ARC Challenge | Acc74.7 | 106 | 3d ago | ||
| IndoCulture native prompts (test) | Gemma2-9B | Accuracy67.5 | 99 | 3d ago | |
| ArabCulture 1.0 (test) | Qwen2.5-7B | Accuracy59.6 | 84 | 3d ago | |
| SciQ | QA | Accuracy100 | 74 | 3d ago | |
| OBQA | QAP | Accuracy87.74 | 61 | 3d ago | |
| HellaSwag | LLaMA-3 8B | Accuracy79.19 | 59 | 3d ago | |
| ARC-Easy (test) | SAES-SVD | Accuracy71.2 | 50 | 3d ago | |
| RACE | QA | Accuracy98.24 | 46 | 3d ago | |
| MMLU 5-shot | EdgeJury | Accuracy73.4 | 45 | 3d ago | |
| ARC Challenge | Non-generative Accuracy0.6451 | 36 | 3d ago | ||
| Bangla MMLU 1.0 (test) | Qwen-2.5-1.5b | Accuracy35 | 33 | 3d ago | |
| TruthfulQA MC1 | EdgeJury | MC1 Accuracy76.2 | 33 | 3d ago | |
| Average (OBQA, ARC, Riddle, PQA) | Llama-3.1-8B-Instruct | Average Accuracy68.31 | 31 | 3d ago | |
| RiddleSense | Similarity-based Router | Accuracy70.59 | 31 | 3d ago | |
| AQuA | GPT-4o + QuaSAR | Accuracy87.4 | 31 | 3d ago | |
| MedQA 5 opts | GPT-4o | Accuracy87 | 26 | 3d ago | |
| ARC Challenge (test) | UnifiedQA_T5-FT | Accuracy54.42 | 26 | 3d ago | |
| Image Implication Multiple-Choice 1.0 (test) | Accuracy78 | 25 | 3d ago | ||
| MedQA | QAP | Accuracy44.01 | 24 | 3d ago | |
| Riddle | QAP | Accuracy76.62 | 24 | 3d ago | |
| MMLU Medical and Biological Sub-tasks | GPT-4 (Medprompt) | Clinical Knowledge Accuracy95.8 | 24 | 3d ago | |
| DREAM | QA+C | Accuracy98.77 | 22 | 3d ago |