| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| TydiQA | ByT5 | Accuracy81.9 | 65 | 2mo ago | |
| TyDiQA 1-shot macro-averaged | Filter-then-Weight | F1 Score (1-shot macro)48.86 | 28 | 2mo ago | |
| TyDiQA GoldP (val) | ByT5 XXL | Ar Score80 | 20 | 1mo ago | |
| mGPQA | self-cons (ours) | Accuracy32.9 | 12 | 1d ago | |
| M3-Exam (test) | EMCEE (Ours) | Accuracy (All)77.4 | 10 | 2d ago | |
| M-ARC | OrthoMerge | Accuracy44.75 | 10 | 3mo ago | |
| Fed-Aya (test) | FedBAT | AR Score2.8 | 6 | 1mo ago | |
| TydiQA | Forgetting | F1 Score48.71 | 4 | 2mo ago | |
| MultiLoKo 31 languages (test) | Qwen3-30B-A3B | Overall Score26 | 4 | 2mo ago | |
| ECleKTic 12 languages (test) | GPT-OSS-120B | Overall Score21.6 | 4 | 2mo ago | |
| xquad vi | ACE | Normalized Performance89.29 | 3 | 3mo ago | |
| xquad zh | ACE | Normalized Performance60.28 | 3 | 3mo ago | |
| Speech-XBelebele Text -> Text | Spectrum | Accuracy63.64 | 1 | 2mo ago |