Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Domain Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool-using ReasoningReasoning Domain Suite (AIME2024, AIME2025, HotpotQA, 2WikiMultihopQA, Musique)
Average Accuracy42.39
13
Showing 1 of 1 rows