Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Open LLM Leaderboard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Leaderboard EvaluationOpen LLM Leaderboard 2
Overall Score51.28
55
General Language Understanding and ReasoningOpen LLM Leaderboard Population (Top-50)
Accuracy60.08
50
Leaderboard EvaluationOpen LLM Leaderboard 1
Overall Score69.28
46
Large Language Model EvaluationOpen LLM Leaderboard
Average Score74.2
41
Language Modeling and ReasoningOpen LLM Leaderboard
ARC82.8
33
Open-style response generationOpen-LLM Leaderboard
Accuracy53.45
28
General LLM EvaluationOpen LLM Leaderboard (test)
Average Score70.1
21
Language Model EvaluationOpen LLM Leaderboard v2 (test)
BBH60.84
20
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard v1 (test)
MMLU-P Accuracy11.5
19
General Language Understanding and ReasoningOpen LLM Leaderboard Lighteval (test)
Mean Accuracy91.07
17
Language ModelingOpen LLM Leaderboard & General Ability Benchmarks (MMLU-P, GPQA, BBH, MATH, MuSR, IFEval, ARC, Hellaswag, PIQA, BoolQ, WinoGrande, COPA, OpenBookQA, SciQ) unified (test)
MMLU-P Accuracy12
16
Large Language Model EvaluationOpen LLM Leaderboard v1 (test)
Average Score69.6
14
Language Modeling EvaluationOpen LLM Leaderboard
ARC70.22
14
Natural Language UnderstandingOpen LLM Leaderboard (test)
ARC57.94
13
General Language UnderstandingOpen LLM Leaderboard HuggingFace 2023a (test)
ARC-c Accuracy (25-shot)59.4
12
General Language EvaluationOpen LLM Leaderboard 1
Overall Score66.12
9
General Language UnderstandingOpen LLM Leaderboard (test)
ARC62.03
9
Reasoning and Language UnderstandingOpen LLM Leaderboard MMLU-PRO, IFEval, BBH, GPQA, MATH, GSM8K, ARC v0.4.0 (test)
MMLU-PRO28.38
7
General Language UnderstandingOpen LLM leaderboard
Average Score65.51
7
Downstream Language UnderstandingOpen LLM Leaderboard zero-shot
ARCE52.9
6
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard
MMLU-P (Accuracy)-
0
Showing 21 of 21 rows