Share your thoughts, 1 month free Claude Pro on usSee more

Overall

Benchmarks

Task Name	Dataset Name	SOTA Result
Mathematical Reasoning	Overall	Accuracy89.6	81
General Reasoning	Overall	Accuracy84.8	40
Question Answering	Overall NQ, TriviaQA, BioASQ, PopQA	Accuracy0.617	32
Reasoning	Overall Combined Benchmarks	Accuracy88.7	31
Macro-average Reasoning	Overall NaturalPlan AIME 2024 GPQA	Final Score (Macro-Avg)96.5	28
Classification	Overall 13 datasets aggregate	N-Mean85.7	26
Mathematical Reasoning	Overall GSM8K, MATH-500, AMC, AIME24, AIME25	Accuracy91.6	26
Math Reasoning	Overall Across five math reasoning datasets	Overall Accuracy45.8	24
General Reasoning	Overall	Accuracy93.51	24
General Reasoning	Overall MATH-500 AIME25 HumanEval GPQA	Accuracy85.1	24
Polyp Segmentation	Overall Combined 5 Datasets (test)	mDice85.1	24
Knowledge Graph Completion	Overall DB15K, MKG-W, MKG-Y	MRR41.04	22
Model Evaluation Summary	Overall Aggregate	Average Score1.003	22
General performance assessment	Overall Combined Benchmarks	Performance (Seen Data)49.64	21
Reasoning	Overall AMC23, AIME24, MATH500, GPQA-D aggregate	Accuracy79.1	21
Polyp Segmentation	Overall Combined Datasets	mDice0.844	21
Commonsense and Logical Reasoning	Overall CSQA, StrategyQA, LogiQA	Accuracy64.95	20
Mathematical Reasoning	Overall Macro-average	Accuracy (%)70.97	20
General Performance	Overall	Overall Score62.05	19
Visual Grounding	Overall	Accuracy84.87	19
Multimodal Continual Learning	Overall 20 Chunks	MAP61.22	18
Multimodal Continual Learning	Overall 15 Chunks	MAP59.99	18
Mathematical Reasoning	Overall Aggregated	Pass@152.3	18
Correctness Prediction	Overall Combined Datasets	Accuracy70.12	18
Combinatorial Optimization	Overall (test)	Average Performance73.01	17

Showing 25 of 84 rows