Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Open LLM Leaderboard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language Modeling and ReasoningOpen LLM Leaderboard
ARC82.8
33
Open-style response generationOpen-LLM Leaderboard
Accuracy53.45
28
Language Model EvaluationOpen LLM Leaderboard v2 (test)
BBH60.84
20
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard v1 (test)
MMLU-P Accuracy11.5
19
Large Language Model EvaluationOpen LLM Leaderboard
Average Score74.2
19
General Language Understanding and ReasoningOpen LLM Leaderboard Lighteval (test)
Mean Accuracy91.07
17
Language ModelingOpen LLM Leaderboard & General Ability Benchmarks (MMLU-P, GPQA, BBH, MATH, MuSR, IFEval, ARC, Hellaswag, PIQA, BoolQ, WinoGrande, COPA, OpenBookQA, SciQ) unified (test)
MMLU-P Accuracy12
16
Large Language Model EvaluationOpen LLM Leaderboard v1 (test)
Average Score69.6
14
Language Modeling EvaluationOpen LLM Leaderboard
ARC70.22
14
Natural Language UnderstandingOpen LLM Leaderboard (test)
ARC57.94
13
General LLM EvaluationOpen LLM Leaderboard (test)
ARC-c78.92
12
General Language UnderstandingOpen LLM Leaderboard (test)
ARC62.03
9
Reasoning and Language UnderstandingOpen LLM Leaderboard MMLU-PRO, IFEval, BBH, GPQA, MATH, GSM8K, ARC v0.4.0 (test)
MMLU-PRO28.38
7
General Language UnderstandingOpen LLM leaderboard
Average Score65.51
7
Downstream Language UnderstandingOpen LLM Leaderboard zero-shot
ARCE52.9
6
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard
MMLU-P (Accuracy)-
0
Showing 16 of 16 rows