Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Large Language Model Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Large Language Model Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
MMLU, GSM8k, HellaSwag, WinoGrande
FP16
Average Score
78.9
58
3mo ago
10 tasks average
DeltaLoss-only
Avg Accuracy
70.56
50
3mo ago
HuggingFace Open LLM Leaderboard
SynPO
GSM8K
55.37
49
2mo ago
Open PL LLM Leaderboard instruction-tuned
Mistral-Large-Instruct-2411
Overall Average Score
69.84
44
3mo ago
Open LLM Leaderboard
SOLAR-10.7B-Instruct-v1.0
Average Score
74.2
41
1mo ago
Qwen3-0.6B Average (test)
EDGERAZOR
Average Performance
47.83
38
27d ago
HuggingFace Open LLM Leaderboard lm-eval-harness default (various)
Teacher
HellaSwag
84.34
36
27d ago
Open LLM Leaderboard v1 (test)
FP16
Average Score
69.6
34
21d ago
MMLU, GSM8K, GPQA, HUMANEVAL, TRUTHFULQA, IFEVAL
GRPO
MMLU
70.7
23
3mo ago
ARC, TruthfulQA, Winogrande, GSM8K, HellaSwag, MMLU
DNPO
ARC Accuracy
73.7
16
2mo ago
12-task evaluation suite composite (test)
FineWeb-Edu
Reading Comprehension Score
49.6
14
3mo ago
Qwen-32B
FP16
MMLU
80.81
13
3mo ago
MMLU, GSM8k, HellaSwag, WinoGrande (test)
FP16
MMLU Accuracy
86.55
13
3mo ago
OpenCompass
Qwen3-30B-A3B
cMMLU
84.88
11
3mo ago
Slovene-LLM-Eval
GaMS-27B-Nemotron
Average Rank
3.05
10
3mo ago
GSM8K, TruthfulQA, CommonsenseQA, MMLU, ARC, and TriviaQA (various)
JoBS
Accuracy
88
9
3mo ago
NorEval (test)
NorwAI-Mistral-7B
Overall Score
0.455
8
3mo ago
SFT Evaluation Suite (AlpacaEval, TruthfulQA, MMLU) (test)
Warmup-Stable-Only (WSO)
AlpacaEval Score
78.1
7
2mo ago
LLaMA 3B 3.2
baseline
PPL
7.81
6
2mo ago
LLaMA 1B 3.2
baseline
Perplexity (PPL)
9.75
6
2mo ago
LLaMA-3 8B
baseline
PPL
6.13
6
2mo ago
LLaMA-2 13B
baseline
Perplexity
4.88
6
2mo ago
LLaMA-2 7B
baseline
PPL
5.47
6
2mo ago
MT-Bench benign prompts
No defense
Average Time Cost
41.56
6
3mo ago
Knowledge Specialized Target (test)
CAMEL
Weighted Average Score
56.5
4
2mo ago
Showing 25 of 30 rows
25 / page
50 / page
100 / page
1
2
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs