Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Average

Benchmarks

Task NameDataset NameSOTA ResultTrend
Conformal InferenceAverage across 15 datasets (test)
Top-1 Accuracy81.1
60
Out-of-Distribution DetectionAverage OpenImage-O, Texture, iNaturalist, ImageNet-O
AUROC96.23
54
Image ClassificationAverage 11 datasets
Base Accuracy87.55
52
Hallucination DetectionAverage Cross-domain
Mean AUROC0.7874
48
Commonsense ReasoningAverage 7 Commonsense Reasoning Tasks
Avg Accuracy72.04
47
Audio ModelingAverage Bach Counting Blues v1 (test)
SNR32.66
46
Error DetectionAverage All shifts (test)
AUC90.99
40
Few-shot Semantic SegmentationAverage Deepglobe, ISIC, Chest X-Ray, FSS-1000
mIoU76.9
32
Cardiac SegmentationAverage (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain)
Dice Score81.25
32
Multiple Choice Question AnsweringAverage (OBQA, ARC, Riddle, PQA)
Average Accuracy68.31
31
Perceptual Image RestorationAverage across datasets (combined)
PSNR31.28
27
Mathematical ReasoningAverage GSM8k-Aug, GSM-Hard, SVAMP, MultiArith
Accuracy79.3
26
Image DerainingAverage across Deraining Datasets (test)
PSNR34.24
26
Video UnderstandingAverage VideoMME, LongVideoBench, MLVU
Score55.9
25
Mathematical ReasoningAverage Six Benchmarks
Accuracy68.66
24
CalibrationAverage StrategyQA, HotpotQA, NQ, Bamboogle
ECE0.264
24
Multi-hop Question AnsweringAverage (2WikiMQA, Bamboogle, Frames, MuSiQue)
Accuracy54.8
24
Instruction Following EvaluationAverage (Vicuna, Self-instruct, Dolly, BPO) (test)
Delta Win Rate (ΔWR)22
24
Aggregate NLP TasksAverage (Emotion, Irony, Stance, MRPC, RTE)
Delta 12.93
24
Continual routingAverage
Accuracy75.2
22
Image DerainingAverage Test100, Test1200, Rain100H, Rain100L (test)
PSNR34.59
21
Video ReconstructionAverage
PSNR27.73
21
Aggregated PerformanceAverage 10 Tasks
Average Accuracy100.6
19
Question AnsweringAverage 10K context
Accuracy74.6
19
OOD DetectionAverage
FPR@9526.47
19
Showing 25 of 124 rows