Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MultiArith

Benchmarks

Task NameDataset NameSOTA ResultTrend
Arithmetic ReasoningMultiArith
Accuracy100
229
Mathematical ReasoningMultiArith
Accuracy100
143
Arithmetic ReasoningMultiArith (test)
Accuracy99.3
67
Math ReasoningMultiArith
Accuracy98.3
65
Mathematical ReasoningMultiArith
Original Accuracy99
40
Math reasoningMultiArith (test)
Accuracy99.59
30
Mathematical reasoningMultiArith Out of Distribution
Top-1 Accuracy (Maj@1)100
30
Group Collusive Attack DetectionMultiArith
Detection Accuracy92
27
Question AnsweringMultiArith
Accuracy74.3
24
Mathematical ReasoningMultiArith
Accuracy100
16
Math ReasoningMultiArith
Accuracy98.3
14
Follow-up Questioning ConsistencyMultiArith (unseen)
Average Success Count18.33
12
Mathematical ReasoningMultiArith
Accuracy43.33
10
Mathematical ReasoningMultiArith
Accuracy (Clean)99.44
8
Mathematical ReasoningMultiArith (val)
Tokens at Best Step (K)1,640
7
Mathematical ReasoningMultiArith OOD
Base Accuracy (CA)100
2
Arithmetic ReasoningMultiArith
Accuracy (format-specific prompt)78.7
2
Showing 17 of 17 rows