Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language ModelingLLM Evaluation Suite
Accuracy83.2
53
Language Understanding and ReasoningLLM Evaluation Suite MMLU, GSM8k, HellaSwag, WinoGrande
MMLU Score64.43
31
Language Understanding and ReasoningLLM Evaluation Suite ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA
Average Accuracy64.01
19
General Language Understanding and ReasoningLLM Evaluation Suite (ARC, CSQA, GSM8K, HS, MMLU, OBQA, PIQA, SIQA, TQA, WG)
ARC45.9
14
Zero-shot Language Understanding and ReasoningLLM Evaluation Suite (MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE) zero-shot LLaDA1.5
Average Score58.59
13
Zero-shot Language Understanding and ReasoningLLM Evaluation Suite (HellaSwag, MMLU, ARC-C, BoolQ, Lambada, ARC-E, HumanEval) zero-shot Qwen3-30B-A3B
HellaSwag Accuracy79.8
12
Model MergingLLM Evaluation Suite
Normalized Score0.401
12
Zero-shot Language UnderstandingLLM Evaluation Suite MMLU, GSM8k, HellaSwag, WinoGrande
MMLU72.8
12
Language Modeling and ReasoningLLM Evaluation Suite ARC, BBH, HellaSwag, TruthfulQA, LAMBADA, WinoGrande, GSM8K, MT-Bench
ARC (Accuracy)54.61
3
Showing 9 of 9 rows