Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Downstream Tasks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Zero-shot EvaluationDownstream Tasks Zero-shot
Accuracy76
278
Zero-shot EvaluationDownstream Tasks MMLU, PIQA, Arc-E, Arc-C, Wino, OpenQA
MMLU77.62
218
Zero-shot ClassificationDownstream Tasks Zero-shot (BoolQ, HellaSwag, WinoGrande, ARC-e, ARC-c, PIQA, OBQA)
BoolQ Accuracy81.65
87
Zero-shot Question Answering and Commonsense ReasoningZero-shot Downstream Tasks (ARC, HellaSwag, WinoGrande, BoolQ, PiQA)
Accuracy (ARC-E)86
45
Downstream Task Accuracy via Paradigm RoutingDownstream Tasks (test)
Accuracy73.4
36
Mean Performance EvaluationDownstream Tasks Summary
Average Accuracy61.5
36
Downstream Task11 Downstream Tasks Aggregate
Average Accuracy64.6
32
Zero-shot Evaluation6 zero-shot downstream tasks
Average Accuracy80.05
31
Zero-shot evaluationDownstream Tasks PiQA ARC Hellaswag Winogrande BoolQ
PiQA Accuracy (Zero-shot)84.4
30
Zero-shot Learning7 Downstream Tasks Avg
Average Score74.1
28
Language Modeling Downstream EvaluationDownstream tasks Average (test)
Average Score58.02
24
Diverse Language Understanding62 downstream tasks
Average Accuracy67.5
18
Zero-shot EvaluationDownstream tasks average
Avg Zero-shot Accuracy81
16
Multiple Choice Question Answering6 downstream tasks (ARC-Challenge, ARC-Easy, HellaSwag, Winogrande, SciQ, PIQA)
ARC-Challenge Accuracy43.6
12
Utility EvaluationDownstream Tasks
Average Accuracy63.4
12
Question AnsweringDownstream Tasks (PiQA, ARC-E, ARC-C, HellaSwag, Winogrande, BoolQ, OBQA, SiQA) Zero-shot
PiQA Accuracy80.63
10
Zero-shot Task Evaluation9 Downstream Tasks Utility
Average Accuracy54.7
10
ReasoningDownstream Tasks PiQA, Arc E, Arc C, HS, WG, BoolQ
PiQA Accuracy80.5
9
Zero-shot Reasoning and Question AnsweringDownstream Tasks (ARC-C, ARC-E, H'SWAG, LAM'DA, OPENBKQA, PIQA, W'GRANDE) zero-shot
ARC-C Score32.76
9
Multiple Choice EvaluationDownstream Tasks (ARC, HellaSwag, PIQA, Winogrande) Standard LLM Eval (val/test)
Average Accuracy56.98
9
Downstream Task Evaluation15 Downstream Tasks summary
Median EG2
7
Class-conditioned generationAverage of 8 downstream tasks
FID10.63
7
Predictive Validity VerificationDownstream Tasks (KOLD, HateXplain, etc.)
Average Correlation0.3156
6
Downstream evaluation9 downstream tasks
Average Accuracy62.6
6
Zero-shot Downstream Task EvaluationDownstream Tasks zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande)
Arc-c49.32
6
Showing 25 of 30 rows