Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA General Language Understanding and Reasoning benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
General Language Understanding and Reasoning
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
General Benchmarks MMLU, HellaSwag, OBQA, WinoGrande, ARC-C, PiQA, SciQ, LogiQA
RegMix
MMLU Accuracy
35.68
70
1mo ago
Open LLM Leaderboard Population (Top-50)
calme-3.2-instruct-78b
Accuracy
60.08
50
2mo ago
Huggingface Open LLM Leaderboard
SimPO
HellaSwag Accuracy
85.32
30
8d ago
TRACE
MagMax
C-STANCE Accuracy
59
29
3mo ago
Open LLM Leaderboard Lighteval (test)
GPT-5
Mean Accuracy
91.07
17
3mo ago
General domain benchmarks (test)
AM-Thinking (math)
DROP Score
93.3
16
3mo ago
Leaderboard Benchmarks (IFEval, BBH, MATH, GPQA, MUSR, MMLU-PRO)
RMiPO
IFEval Score
69.07
14
1mo ago
MMLU-Redux
Qwen 3 14B
Accuracy
83.7
14
3mo ago
LLM Evaluation Suite (ARC, CSQA, GSM8K, HS, MMLU, OBQA, PIQA, SIQA, TQA, WG)
Muon (OSP)
ARC
45.9
14
3mo ago
Academic Benchmarks (test)
Camelidae-8x34B-pro
Average Score
59.9
10
3mo ago
7-Task Evaluation Suite (HellaSwag, MathQA, MMLU, OpenBookQA, WinoGrande, GSM8K, HumanEval)
BITSMOE
Average Accuracy
61.91
8
1d ago
General LLM Evaluation Suite ARC-C ARC-E BoolQ HellaSwag MMLU OBQA RTE WinoGrande
Full model
ARC-Challenge Accuracy
62.5
5
1mo ago
OpenLLM Leaderboard BBH, GPQA, IFEVAL, MMLU, MUSR (test)
Qwen 2.5 72B
BBH
72.7
4
3mo ago
MMLU, MMLU-Redux, MMLU-Pro, AGIEval, BBH, ARC-Easy, ARC-Challenge, BoolQ
-
-
0
1mo ago
Showing 14 of 14 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs