Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BigBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningBigBenchHard
Accuracy (BigBenchHard)100
22
General ReasoningBigBench-Lite Topic Domains
BBL Score66.8
18
Commonsense ReasoningBigBenchHard
Accuracy71.7
18
Discriminative tasksBigBench 13 tasks (val)
Accuracy58.7
17
Question AnsweringBIGBENCH II
True WS Score100
12
Generative ClassificationBigBench (test)
Accuracy76.6
10
Natural Language ProcessingBigBench II
Accuracy Degradation (%)-0.37
9
Audio-based ReasoningBigBench Audio
Accuracy73.77
8
Language Modeling and ReasoningBigBench (Lamb, SQuAD, CoQA, BBH, LSAT, LangID)
Avg Score24
8
Instruction InductionBigBench Instruction Induction (BBII) (test)
BBII Text Classification Score60.14
6
Linguistic ReasoningBigBench Hard Hyperbaton
Accuracy80.2
5
Linguistic ReasoningBigBench Hard Snarks
Accuracy0.554
5
Logical ReasoningBigBench Hard Formal Fallacies
Accuracy58.2
5
Multi-task ReasoningBigBench Hard
Score31.1
5
Streaming Voice-Agent Interaction EfficiencyBigBench Audio
NFE80.73
5
ReasoningBigBench Extra Hard
mean@414.3
4
Contextual ReasoningBigBenchHard
EM63.23
4
ReasoningBigBench-H SEM variant
Accuracy95.08
2
Reasoning and Language UnderstandingBigBench Emergent Suite (BBES)
Navigate67
2
LLM Performance PredictionBigBench (val)
Metric-
0
Showing 20 of 20 rows