Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Overall

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningOverall
Accuracy89.6
81
General ReasoningOverall
Accuracy84.8
40
Question AnsweringOverall NQ, TriviaQA, BioASQ, PopQA
Accuracy0.617
32
ReasoningOverall Combined Benchmarks
Accuracy88.7
31
Macro-average ReasoningOverall NaturalPlan AIME 2024 GPQA
Final Score (Macro-Avg)96.5
28
ClassificationOverall 13 datasets aggregate
N-Mean85.7
26
Mathematical ReasoningOverall GSM8K, MATH-500, AMC, AIME24, AIME25
Accuracy91.6
26
Math ReasoningOverall Across five math reasoning datasets
Overall Accuracy45.8
24
General ReasoningOverall
Accuracy93.51
24
General ReasoningOverall MATH-500 AIME25 HumanEval GPQA
Accuracy85.1
24
Polyp SegmentationOverall Combined 5 Datasets (test)
mDice85.1
24
Knowledge Graph CompletionOverall DB15K, MKG-W, MKG-Y
MRR41.04
22
Model Evaluation SummaryOverall Aggregate
Average Score1.003
22
General performance assessmentOverall Combined Benchmarks
Performance (Seen Data)49.64
21
ReasoningOverall AMC23, AIME24, MATH500, GPQA-D aggregate
Accuracy79.1
21
Polyp SegmentationOverall Combined Datasets
mDice0.844
21
Commonsense and Logical ReasoningOverall CSQA, StrategyQA, LogiQA
Accuracy64.95
20
Mathematical ReasoningOverall Macro-average
Accuracy (%)70.97
20
General PerformanceOverall
Overall Score62.05
19
Visual GroundingOverall
Accuracy84.87
19
Multimodal Continual LearningOverall 20 Chunks
MAP61.22
18
Multimodal Continual LearningOverall 15 Chunks
MAP59.99
18
Mathematical ReasoningOverall Aggregated
Pass@152.3
18
Correctness PredictionOverall Combined Datasets
Accuracy70.12
18
Combinatorial OptimizationOverall (test)
Average Performance73.01
17
Showing 25 of 84 rows