Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BigBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Commonsense ReasoningBigBenchHard
Accuracy71.7
18
Discriminative tasksBigBench 13 tasks (val)
Accuracy58.7
17
Audio-based ReasoningBigBench Audio
Accuracy73.77
8
Language Modeling and ReasoningBigBench (Lamb, SQuAD, CoQA, BBH, LSAT, LangID)
Avg Score24
8
Instruction InductionBigBench Instruction Induction (BBII) (test)
BBII Text Classification Score60.14
6
ReasoningBigBenchHard
Accuracy (BigBenchHard)82.4
5
Multi-task ReasoningBigBench Hard
Score31.1
5
Streaming Voice-Agent Interaction EfficiencyBigBench Audio
NFE80.73
5
ReasoningBigBench Extra Hard
mean@414.3
4
Contextual ReasoningBigBenchHard
EM63.23
4
Reasoning and Language UnderstandingBigBench Emergent Suite (BBES)
Navigate67
2
LLM Performance PredictionBigBench (val)
Metric-
0
Showing 12 of 12 rows