Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MMLU, GSM8K, GPQA, HumanEval, TruthfulQA, IFEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Large Language Model EvaluationMMLU, GSM8K, GPQA, HUMANEVAL, TRUTHFULQA, IFEVAL
MMLU70.7
23
General Language CapabilitiesMMLU, GSM8K, GPQA, HumanEval, TruthfulQA, IFEval Aggregate
Average Score71.2
10
Showing 2 of 2 rows