Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Language Understanding and Reasoning benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Language Understanding and Reasoning
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
LLM Evaluation Suite MMLU, GSM8k, HellaSwag, WinoGrande
FP16
MMLU Score
64.43
31
3mo ago
MMLU Pro
AIMMerging
Overall Score
67.8
20
22d ago
LLM Evaluation Suite ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA
Dense
Average Accuracy
64.01
19
3mo ago
General Capability Suite (MMLU, TruthfulQA, HellaSwag, ARC-Easy) (test)
Unlearn-Smooth
MMLU Score
0.082
16
19d ago
MMLU, PIQA, HellaSwag, WinoGrande, ARC-Challenge
BF16
MMLU (5s)
46.45
13
3mo ago
LM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG
Dense
ARC-c Accuracy
58.4
12
3mo ago
German NLP Benchmark Suite (MMLU, ARC-C, ARC-E, H-Swag, LAMBADA, OBQA)
Qwen3-1.7B-Base
MMLU
34.17
10
1mo ago
MMLU, BoolQ, ARC-e, PIQA, Hellaswag, OBQA, Winogrande (test)
Pre-LN + LayerNorm Scaling
MMLU
28.69
10
3mo ago
ARC-Challenge, ARC-Easy, BoolQ, HellaSwag (HS), OpenBookQA (OBQA), PIQA, Winogrande (WG), MMLU (test)
Original
ARC-Challenge Accuracy
53.16
9
19d ago
Language Understanding Evaluation Suite (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) Zyda2 calibration (test)
Moonlight
ARC-c
58.28
6
3mo ago
OLMES Standard
OLMo-2-0425-1B
ARC-Easy Accuracy
75.9
5
2mo ago
Combined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA)
MoSLoRA
Average Score
72.94
5
3mo ago
Downstream Task Suite (PIQA, ARC-e, HellaSwag, GPQA, Lambada, MMLU, BBH)
DAG-MoE-l
PIQA
50.67
2
1d ago
Mixtral-8x7B Zero-shot Suite: Arc C, BoolQ, Lambada, PIQA, Winogrande v0.1-Instruct
FP16
Arc C Accuracy
65.7
2
2mo ago
Showing 14 of 14 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs