| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BENCH-PROXY (MMLU, ANLI, HellaSwag, PIQA, SIQA, W.G., ARC-E, ARC-C, C.QA, WSC) (test) | MMLU34.32 | 24 | 1mo ago | ||
| LLM Evaluation Suite (MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE) zero-shot LLaDA1.5 | Average Score58.59 | 13 | 1mo ago | ||
| LLM Evaluation Suite (HellaSwag, MMLU, ARC-C, BoolQ, Lambada, ARC-E, HumanEval) zero-shot Qwen3-30B-A3B | HellaSwag Accuracy79.8 | 12 | 17d ago | ||
| nine-benchmark suite (MMLU, ARC, CSQA, HellaSwag, OpenBookQA, PIQA, SocialIQA, WinoGrande) (test val) | FLUX | MMLU Accuracy31.7 | 6 | 1mo ago |