Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Language Model Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Language Model Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
BenchPress short-context (test)
Qwen3-8B
Accuracy
68.84
131
22d ago
Winogrande, ARC-C, ARC-E, Lambada, PIQA, Hellaswag, MMLU, IFEval, and GSM8K-CoT (Mixed standard 10-shot prompt)
IF4
Accuracy
80.39
88
21d ago
Open LLM Leaderboard v2 (test)
Qwen3-8B
BBH
60.84
47
22d ago
ArxivRollBench 2026a
moonshotai/kimi-k2.6
Valid Accuracy
70.8
42
8d ago
Qwen3-0.6B Evaluation Suite average
EDGERAZOR
Average Performance
47.8
24
27d ago
Pooled tasks Table 5 Llama-3.1 3.3 (various)
Llama-3.3 70B Instruct
Pooled Accuracy Estimate (γ̂)
57.15
21
3mo ago
AdaptEval
SCALENET (Layer-wise)
ROUGE-Lsum
0.2733
14
3mo ago
DCLM Core
Composer (2Mb-M-3A)
DCLM Core Score
49.3
12
16d ago
lm-eval-harness (test)
E2E
MMLU
74.22
9
27d ago
Downstream Tasks Evaluation Suite Math, Code, Law, Know., Reason., MMLU
Math Dense Expert
Math Accuracy
4.92
9
3mo ago
DATE-LM MMLU, GSM8K, BBH
Grad Sim
MMLU Accuracy
62.07
7
2d ago
Quality, Factuality, and Safety Evaluation Suite (test)
Self-Improving Pretraining
Generation Quality Score
86.3
7
3mo ago
NLP Evaluation Suite (WG, PIQA, BoolQ, ARC-C, ARC-E, OBQA, HS, SciQ, LM, RTE)
QK sharing
WG
60.14
6
3mo ago
Open LeaderBoard v2
AgentSociety
AS
0.6547
5
7d ago
1.3B LLM Leaderboard
QA+C4-85B
ARC
36.4
5
3mo ago
MMLU-Pro
AgentSociety
AS
77.54
1
7d ago
Showing 16 of 16 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs