Share your thoughts, 1 month free Claude Pro on usSee more

Average

Benchmarks

Task Name	Dataset Name	SOTA Result
Image Classification	Average 11 datasets	Base Accuracy87.55	95
Commonsense Reasoning	Average 7 Commonsense Reasoning Tasks	Avg Accuracy72.04	72
Conformal Inference	Average across 15 datasets (test)	Top-1 Accuracy81.1	60
Few-shot Semantic Segmentation	Average Deepglobe, ISIC, Chest X-Ray, FSS-1000	mIoU76.9	57
Out-of-Distribution Detection	Average OpenImage-O, Texture, iNaturalist, ImageNet-O	AUROC96.23	54
Multi-hop Question Answering	Average (MuSiQue, HotpotQA, 2WikiMultiHopQA, LongSeal)	Average QA Score46.49	50
Hallucination Detection	Average Cross-domain	Mean AUROC0.7874	48
Few-shot Image Classification	Average 11 datasets (test)	Average Accuracy (Few-shot)87.41	47
Semantic Segmentation	Average Overall	mIoU64.3	46
Question Answering	Average of 5 datasets	Average Score78.9	46
Audio Modeling	Average Bach Counting Blues v1 (test)	SNR32.66	46
Mathematical Reasoning	Average GSM8k-Aug, GSM-Hard, SVAMP, MultiArith	Accuracy79.3	42
Error Detection	Average All shifts (test)	AUC90.99	40
Math and Science Reasoning	Average	Accuracy89.7	36
SOC prediction	Average (FTP-75 and PDMHC)	MAE0.4118	35
Perceptual Image Restoration	Average across datasets (combined)	PSNR37.6	35
Base-to-novel generalization	Average base-to-novel	Base Score85.55	33
Continual Category Discovery	Average fine-grained	cACC (All)81.49	32
Model Selection	Average	Weighted Kendall's Tau (w)0.47	32
Cardiac Segmentation	Average (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain)	Dice Score81.25	32
Multiple Choice Question Answering	Average (OBQA, ARC, Riddle, PQA)	Average Accuracy68.31	31
OOD Detection	Average	FPR@9526.47	31
General Evaluation	Average Across all benchmarks	Speedup6.9	28
Mathematical Reasoning	Average GSM8k, GSM-sym, MATH, NuminaMath, AIME	Accuracy71.41	28
Reasoning	Average (MATH, OlympicBench, MinervaMath, AMC2023, GPQA) (test)	Pass@159.83	27

Showing 25 of 298 rows

...