Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AVG

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringAVG.
EM47.5
48
Speculative DecodingAvg.
Speedup6.73
24
Mathematical ReasoningAvg. (held out)
Accuracy62.8
24
change detectionAvg across SYSU, LEVIR, GVLM, CLCD, OSCD
Precision84.8
23
General ReasoningAVG Reasoning Suite
Accuracy77.4
18
Low-Light Image EnhancementAVG. DICM, MEF, LIME, NPE, VV
NIQE3.589
17
Document RerankingAVG
NDCG@554.373
14
Reasoning Performance (Aggregate)AVG
TPF351
14
Question AnsweringAVG. Aggregate of NQ, TQA, HQA, 2WIKI (test)
EM42.5
14
Emotion Recognition in ConversationAVG IEMOCAP, MELD, EmoryNLP
W-F160.91
11
Selective classificationAvg (all)
AURC (10^-2 Scale)0.215
11
Selective classificationAvg 1K
AURC (Scale 10^-2)0.248
11
GeolocationAVG (test)
City Acc (25km)8.3
10
DetectionAVG
AUC0.897
10
Multi-task Language UnderstandingAVG Across All Benchmarks
Throughput12.89
8
Scene Text RecognitionAVG 12 benchmarks
Word Accuracy91.33
8
Preference AlignmentAvg.
GRA70.6
4
Machine TranslationAvg (News, Flores, Subtitle, Travel) German-English (test aggregate)
DA87.77
4
Showing 18 of 18 rows