Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Computational and Knowledge-Intensive Reasoning Tasks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool-integrated ReasoningComputational and Knowledge-Intensive Reasoning Tasks (AIME24, AIME25, MATH500, GSM8K, MATH, WebWalker, HQA, 2Wiki., MuSiQ., Bamb.) latest (test)
AIME 24 Score34.2
30
Showing 1 of 1 rows