Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Average

Benchmarks

Task NameDataset NameSOTA ResultTrend
Image ClassificationAverage 11 datasets
Base Accuracy87.55
83
Commonsense ReasoningAverage 7 Commonsense Reasoning Tasks
Avg Accuracy72.04
72
Conformal InferenceAverage across 15 datasets (test)
Top-1 Accuracy81.1
60
Few-shot Semantic SegmentationAverage Deepglobe, ISIC, Chest X-Ray, FSS-1000
mIoU76.9
54
Out-of-Distribution DetectionAverage OpenImage-O, Texture, iNaturalist, ImageNet-O
AUROC96.23
54
Hallucination DetectionAverage Cross-domain
Mean AUROC0.7874
48
Few-shot Image ClassificationAverage 11 datasets (test)
Average Accuracy (Few-shot)87.41
47
Question AnsweringAverage of 5 datasets
Average Score78.9
46
Audio ModelingAverage Bach Counting Blues v1 (test)
SNR32.66
46
Error DetectionAverage All shifts (test)
AUC90.99
40
Perceptual Image RestorationAverage across datasets (combined)
PSNR37.6
35
Model SelectionAverage
Weighted Kendall's Tau (w)0.47
32
Cardiac SegmentationAverage (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain)
Dice Score81.25
32
Multiple Choice Question AnsweringAverage (OBQA, ARC, Riddle, PQA)
Average Accuracy68.31
31
Semantic SegmentationAverage Overall
mIoU51.9
28
ReasoningAverage (MATH, OlympicBench, MinervaMath, AMC2023, GPQA) (test)
Pass@159.83
27
Multi-hop Question AnsweringAverage MuSiQue, 2wiki, HotpotQA
F1 Score63.02
26
Mathematical ReasoningAverage GSM8k-Aug, GSM-Hard, SVAMP, MultiArith
Accuracy79.3
26
Image DerainingAverage across Deraining Datasets (test)
PSNR34.24
26
Video UnderstandingAverage VideoMME, LongVideoBench, MLVU
Score55.9
25
Question AnsweringAverage (TriviaQA, HotpotQA, 2Wiki, Musique, Bamboogle)
EM Accuracy49.12
24
Mathematical ReasoningAverage Six Benchmarks
Accuracy68.66
24
ClassificationAverage
Base Score87.41
24
CalibrationAverage StrategyQA, HotpotQA, NQ, Bamboogle
ECE0.264
24
Multi-hop Question AnsweringAverage (2WikiMQA, Bamboogle, Frames, MuSiQue)
Accuracy54.8
24
Showing 25 of 200 rows
...