Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Open LLM Leaderboard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Leaderboard EvaluationOpen LLM Leaderboard 2
Overall Score51.28
55
General Language Understanding and ReasoningOpen LLM Leaderboard Population (Top-50)
Accuracy60.08
50
Language Model EvaluationOpen LLM Leaderboard v2 (test)
BBH60.84
47
Leaderboard EvaluationOpen LLM Leaderboard 1
Overall Score69.28
46
Large Language Model EvaluationOpen LLM Leaderboard
Average Score74.2
41
Large Language Model EvaluationOpen LLM Leaderboard v1 (test)
Average Score69.6
34
Language Modeling and ReasoningOpen LLM Leaderboard
ARC82.8
33
Open-style response generationOpen-LLM Leaderboard
Accuracy53.45
28
Performance EstimationOpen LLM Leaderboard subset-selection
MAE1.4
24
General LLM EvaluationOpen LLM Leaderboard (test)
Average Score70.1
21
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard v1 (test)
MMLU-P Accuracy11.5
19
General Language Understanding and ReasoningOpen LLM Leaderboard Lighteval (test)
Mean Accuracy91.07
17
Language ModelingOpen LLM Leaderboard & General Ability Benchmarks (MMLU-P, GPQA, BBH, MATH, MuSR, IFEval, ARC, Hellaswag, PIQA, BoolQ, WinoGrande, COPA, OpenBookQA, SciQ) unified (test)
MMLU-P Accuracy12
16
General Language CapabilitiesOpen LLM Leaderboard lm-eval-harness (test)
HellaSwag Accuracy83.29
14
Language Modeling EvaluationOpen LLM Leaderboard
ARC70.22
14
Natural Language UnderstandingOpen LLM Leaderboard (test)
ARC57.94
13
General Language UnderstandingOpen LLM Leaderboard HuggingFace 2023a (test)
ARC-c Accuracy (25-shot)59.4
12
Ranking quality gain estimationOpen LLM Leaderboard v2
Ranking Quality Gain21.04
9
General Language EvaluationOpen LLM Leaderboard 1
Overall Score66.12
9
General Language UnderstandingOpen LLM Leaderboard (test)
ARC62.03
9
Reasoning and Language UnderstandingOpen LLM Leaderboard MMLU-PRO, IFEval, BBH, GPQA, MATH, GSM8K, ARC v0.4.0 (test)
MMLU-PRO28.38
7
General Language UnderstandingOpen LLM leaderboard
Average Score65.51
7
Downstream Language UnderstandingOpen LLM Leaderboard zero-shot
ARCE52.9
6
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard
MMLU-P (Accuracy)-
0
Showing 24 of 24 rows