Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GSM-Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Math reasoningGSM-Hard (test)
Accuracy55.94
30
Mathematical ReasoningGSM-Hard OOD
Greedy Accuracy19
23
Mathematical ReasoningGSM-Hard OOD 1.0 (test)
Greedy Success Rate12
9
Mathematical ReasoningGSM-Hard Out-of-Distribution (test)
Final Answer Accuracy71
5
Showing 4 of 4 rows