Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Large Language Model Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Large Language Model Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
MMLU, GSM8k, HellaSwag, WinoGrande
FP16
Average Score
78.9
58
1mo ago
10 tasks average
DeltaLoss-only
Avg Accuracy
70.56
50
1mo ago
HuggingFace Open LLM Leaderboard
SynPO
GSM8K
55.37
49
25d ago
Open PL LLM Leaderboard instruction-tuned
Mistral-Large-Instruct-2411
Overall Average Score
69.84
44
1mo ago
Open LLM Leaderboard
SOLAR-10.7B-Instruct-v1.0
Average Score
74.2
41
4d ago
MMLU, GSM8K, GPQA, HUMANEVAL, TRUTHFULQA, IFEVAL
GRPO
MMLU
70.7
23
1mo ago
HuggingFace Open LLM Leaderboard lm-eval-harness default (various)
Teacher
HellaSwag
84.34
18
1mo ago
ARC, TruthfulQA, Winogrande, GSM8K, HellaSwag, MMLU
DNPO
ARC Accuracy
73.7
16
1mo ago
12-task evaluation suite composite (test)
FineWeb-Edu
Reading Comprehension Score
49.6
14
1mo ago
Open LLM Leaderboard v1 (test)
FP16
Average Score
69.6
14
1mo ago
Qwen-32B
FP16
MMLU
80.81
13
1mo ago
MMLU, GSM8k, HellaSwag, WinoGrande (test)
FP16
MMLU Accuracy
86.55
13
1mo ago
OpenCompass
Qwen3-30B-A3B
cMMLU
84.88
11
1mo ago
Slovene-LLM-Eval
GaMS-27B-Nemotron
Average Rank
3.05
10
1mo ago
GSM8K, TruthfulQA, CommonsenseQA, MMLU, ARC, and TriviaQA (various)
JoBS
Accuracy
88
9
1mo ago
NorEval (test)
NorwAI-Mistral-7B
Overall Score
0.455
8
1mo ago
SFT Evaluation Suite (AlpacaEval, TruthfulQA, MMLU) (test)
Warmup-Stable-Only (WSO)
AlpacaEval Score
78.1
7
1mo ago
LLaMA 3B 3.2
baseline
PPL
7.81
6
1mo ago
LLaMA 1B 3.2
baseline
Perplexity (PPL)
9.75
6
1mo ago
LLaMA-3 8B
baseline
PPL
6.13
6
1mo ago
LLaMA-2 13B
baseline
Perplexity
4.88
6
1mo ago
LLaMA-2 7B
baseline
PPL
5.47
6
1mo ago
MT-Bench benign prompts
No defense
Average Time Cost
41.56
6
1mo ago
Knowledge Specialized Target (test)
CAMEL
Weighted Average Score
56.5
4
1mo ago
Code Specialized Target (test)
CAMEL
Weighted Average Score
52.8
4
1mo ago
Showing 25 of 26 rows
25 / page
50 / page
100 / page
1
2
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs