Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Overall

Benchmarks

Task NameDataset NameSOTA ResultTrend
General ReasoningOverall
Accuracy84.8
40
Mathematical ReasoningOverall
Accuracy80.17
36
Question AnsweringOverall NQ, TriviaQA, BioASQ, PopQA
Accuracy0.617
32
Macro-average ReasoningOverall NaturalPlan AIME 2024 GPQA
Final Score (Macro-Avg)96.5
28
ClassificationOverall 13 datasets aggregate
N-Mean85.7
26
Mathematical ReasoningOverall GSM8K, MATH-500, AMC, AIME24, AIME25
Accuracy91.6
26
General ReasoningOverall MATH-500 AIME25 HumanEval GPQA
Accuracy85.1
24
Polyp SegmentationOverall Combined 5 Datasets (test)
mDice85.1
24
Knowledge Graph CompletionOverall DB15K, MKG-W, MKG-Y
MRR41.04
22
Model Evaluation SummaryOverall Aggregate
Average Score1.003
22
ReasoningOverall AMC23, AIME24, MATH500, GPQA-D aggregate
Accuracy79.1
21
Polyp SegmentationOverall Combined Datasets
mDice0.844
21
Mathematical ReasoningOverall Macro-average
Accuracy (%)70.97
20
General PerformanceOverall
Overall Score62.05
19
Visual GroundingOverall
Accuracy84.87
19
Correctness PredictionOverall Combined Datasets
Accuracy70.12
18
Emotion ReasoningOverall (test)
Factual Alignment (FA)3.54
17
Question AnsweringOverall
Accuracy77.1
15
Survival PredictionOverall Across Cohorts
C-Index0.629
15
Reward ModelingOverall 5-Benchmark Suite
Average Score73.5
12
Question AnsweringOverall
EM41.6
11
AI-generated image detectionOverall In-the-wild Aggregate
Average Accuracy91.8
11
SummarizationOverall Multi-dataset Average
Completeness48
11
Satellite-to-Ground RetrievalOverall
Recall@153.5
10
Ground-to-Satellite RetrievalOverall
Recall@144.6
10
Showing 25 of 65 rows