Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Average

Benchmarks

Task NameDataset NameSOTA ResultTrend
Image ClassificationAverage 11 datasets
Base Accuracy87.55
95
Commonsense ReasoningAverage 7 Commonsense Reasoning Tasks
Avg Accuracy72.04
72
Conformal InferenceAverage across 15 datasets (test)
Top-1 Accuracy81.1
60
Few-shot Semantic SegmentationAverage Deepglobe, ISIC, Chest X-Ray, FSS-1000
mIoU76.9
54
Out-of-Distribution DetectionAverage OpenImage-O, Texture, iNaturalist, ImageNet-O
AUROC96.23
54
Multi-hop Question AnsweringAverage (MuSiQue, HotpotQA, 2WikiMultiHopQA, LongSeal)
Average QA Score46.49
50
Hallucination DetectionAverage Cross-domain
Mean AUROC0.7874
48
Few-shot Image ClassificationAverage 11 datasets (test)
Average Accuracy (Few-shot)87.41
47
Semantic SegmentationAverage Overall
mIoU64.3
46
Question AnsweringAverage of 5 datasets
Average Score78.9
46
Audio ModelingAverage Bach Counting Blues v1 (test)
SNR32.66
46
Error DetectionAverage All shifts (test)
AUC90.99
40
Math and Science ReasoningAverage
Accuracy89.7
36
SOC predictionAverage (FTP-75 and PDMHC)
MAE0.4118
35
Perceptual Image RestorationAverage across datasets (combined)
PSNR37.6
35
Continual Category DiscoveryAverage fine-grained
cACC (All)81.49
32
Model SelectionAverage
Weighted Kendall's Tau (w)0.47
32
Cardiac SegmentationAverage (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain)
Dice Score81.25
32
Multiple Choice Question AnsweringAverage (OBQA, ARC, Riddle, PQA)
Average Accuracy68.31
31
Mathematical ReasoningAverage GSM8k-Aug, GSM-Hard, SVAMP, MultiArith
Avg Length0
31
OOD DetectionAverage
FPR@9526.47
31
General EvaluationAverage Across all benchmarks
Speedup6.9
28
Mathematical ReasoningAverage GSM8k, GSM-sym, MATH, NuminaMath, AIME
Accuracy71.41
28
ReasoningAverage (MATH, OlympicBench, MinervaMath, AMC2023, GPQA) (test)
Pass@159.83
27
Multi-hop Question AnsweringAverage MuSiQue, 2wiki, HotpotQA
EM43.3
27
Showing 25 of 263 rows
...