Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Overall

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringOverall NQ, TriviaQA, BioASQ, PopQA
Accuracy0.617
32
Macro-average ReasoningOverall NaturalPlan AIME 2024 GPQA
Final Score (Macro-Avg)96.5
28
Mathematical ReasoningOverall GSM8K, MATH-500, AMC, AIME24, AIME25
Accuracy91.6
26
Polyp SegmentationOverall Combined 5 Datasets (test)
mDice85.1
24
Knowledge Graph CompletionOverall DB15K, MKG-W, MKG-Y
MRR41.04
22
Model Evaluation SummaryOverall Aggregate
Average Score1.003
22
Polyp SegmentationOverall Combined Datasets
mDice0.844
21
Mathematical ReasoningOverall Macro-average
Accuracy (%)70.97
20
Correctness PredictionOverall Combined Datasets
Accuracy70.12
18
Emotion ReasoningOverall (test)
Factual Alignment (FA)3.54
17
Question AnsweringOverall
Accuracy77.1
15
Survival PredictionOverall Across Cohorts
C-Index0.629
15
Visual GroundingOverall
Accuracy84.87
12
AI-generated image detectionOverall In-the-wild Aggregate
Average Accuracy91.8
11
SummarizationOverall Multi-dataset Average
Completeness48
11
Mathematical ReasoningOverall Combined Benchmarks
Avg@3 Score58.4
10
RetrievalOverall (Average)
Recall@1036.6
10
Question AnsweringOverall Average (test)
EM58.3
10
Adversarial Code ComplianceOverall Mean
Decoupling Probability97.1
9
Tool-Integrated ReasoningOverall 9 Benchmarks
Average Score88
9
RetrievalOverall (Musique, HotpotQA, NarrativeQA, DetectiveQA)
Avg Recall@356.64
8
Aggregate PerformanceOverall Across All Benchmarks
SUM563.56
8
Molecule Property PredictionOverall
Top-1 Count21
8
Aggregated Logical ReasoningOverall Mean
Accuracy76.2
7
Aggregated Logical ReasoningOverall Unsolvable
Accuracy0.945
7
Showing 25 of 40 rows