Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

General Capability Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
General LLM CapabilityGeneral Capability Suite MMLU, AlpacaEval, GSM8K, MATH, HumanEval
MMLU71.86
56
General Capability EvaluationGeneral Capability Suite MMLU, GSM8K, HumanEval, IFEval
Common Average Score77.78
39
General Capability EvaluationGeneral Capability Suite ARC-C, HellaSwag, MMLU, GSM8K
ARC-C Accuracy54.27
27
General Knowledge PreservationGeneral Capability Suite HS WG IFEval MMLU
HS Delta17.7
22
General Language Capability EvaluationGeneral Capability Suite Aggregate
General Capability Avg. Accuracy62.51
18
Language Understanding and ReasoningGeneral Capability Suite (MMLU, TruthfulQA, HellaSwag, ARC-Easy) (test)
MMLU Score0.082
16
General Capability EvaluationGeneral Capability Suite
Average Score71
12
General Language CapabilityGeneral Capability Suite (MMLU, GSM8K, GPQA)
MMLU Accuracy73.6
5
Showing 8 of 8 rows