Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLM Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language Understanding and ReasoningLLM Evaluation Suite ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA
Average Accuracy64.01
19
General Language Understanding and ReasoningLLM Evaluation Suite (ARC, CSQA, GSM8K, HS, MMLU, OBQA, PIQA, SIQA, TQA, WG)
ARC45.9
14
Zero-shot Language Understanding and ReasoningLLM Evaluation Suite (MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE) zero-shot LLaDA1.5
Average Score58.59
13
Language Modeling and ReasoningLLM Evaluation Suite ARC, BBH, HellaSwag, TruthfulQA, LAMBADA, WinoGrande, GSM8K, MT-Bench
ARC (Accuracy)54.61
3
Showing 4 of 4 rows