Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA General Language Understanding benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
General Language Understanding
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
tinyBenchmark
No Steering
Accuracy (ARC)
77.51
81
19d ago
GLUE
Full FT
Accuracy
92.5
66
1mo ago
GLUE v1 (test dev)
BOMF (Ours)
MNLI
87.86
40
1mo ago
Standard Downstream Tasks Suite (SciQ, PIQA, WinoGrande, ARC-E, ARC-C, HellaSwag, LogiQA, BoolQ, LAMBADA, MMLU)
ConceptLM
Average Accuracy
48.3
32
1mo ago
MMLU
Origin
MMLU Score
73.59
28
26d ago
General LLM Benchmarks (ARC-C, CSQA, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, Winogrande) (test)
Original
ARC-C Accuracy
59.5
22
1mo ago
12-task evaluation suite (test)
Efficient-DLM 8B
Average Score
71.62
20
1mo ago
C-Eval (val)
Qwen-1.5 14B (Teacher)
Accuracy
78.68
18
1mo ago
General Ability Suite (C-QA, T-QA, LAM, MMLU, L-Code)
Base
Average Score
48.1
16
1mo ago
8 Sub-Tasks (test)
LoRA
Performance on 8 Sub-Tasks
62.3
14
1mo ago
NLP Evaluation Suite (SciQ, PIQA, WG, ARC, HellaSwag, LogiQA, BoolQ, LAMBADA)
MOUE L48
SciQ Accuracy
58.3
14
1mo ago
CMMLU
NBDiff-7B-BASE
Overall Accuracy
77.3
14
1mo ago
All tasks (25 tasks) (val)
polish-roberta-8k
Overall Accuracy
85.93
13
1mo ago
General Language Tasks Suite (WikiText-2, MMLU, PIQA, HellaSwag, WinoGrande, ARC-Challenge) standard (various)
BF16
PPL
4.88
13
1mo ago
MMLU
Qwen2.5-14B
Accuracy
71.6
12
8d ago
Open LLM Leaderboard HuggingFace 2023a (test)
LLaMA-2-13B (Zero-shot)
ARC-c Accuracy (25-shot)
59.4
12
1mo ago
BIG-bench Mimicked
Llama 3-8b
Sports Score
99.7
11
1mo ago
BIG-bench Original
Claude 2.1
Sports Score
99.4
11
1mo ago
P3 v1 (unseen)
T0-11B
RTE Accuracy
80.83
11
1mo ago
MMLU, AGIEval
TSV-M + OrthoMerge-G
MMLU Score
0.5592
10
1mo ago
Open LLM Leaderboard (test)
Falcon Mamba-7B
ARC
62.03
9
1mo ago
General Downstream Tasks Aggregate
PonderLM-2-Pythia-1.4B
Average Accuracy
59.5
8
1mo ago
Composite Evaluation Suite C3 to CMMLU
Dense
Avg Accuracy
0.492
8
1mo ago
Russian SuperGLUE (test)
Human Performance
LiDiRus (Corr)
62.6
8
1mo ago
Open LLM leaderboard
HyPO-Llama-3
Average Score
65.51
7
1mo ago
Showing 25 of 44 rows
25 / page
50 / page
100 / page
1
2
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs