Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GSM8K, MATH, HumanEval, MBPP, FinanceBench, ConvFinQA, PubMedQA, and MedQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-domain EvaluationGSM8K, MATH, HumanEval, MBPP, FinanceBench, ConvFinQA, PubMedQA, and MedQA USMLE
Math Accuracy30.65
24
Showing 1 of 1 rows