Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GSM8K, Math, AIME, HumanEval, LiveCodeBench, ARC, MMLU, GPQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reasoning and KnowledgeGSM8K, Math, AIME, HumanEval, LiveCodeBench, ARC-C, ARC-E, MMLU, GPQA
GSM8K95.41
9
Showing 1 of 1 rows