Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BeyondAIME

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningBeyondAIME
avg@1661.7
23
Mathematical ReasoningBeyondAIME
Accuracy82.5
18
Confidence CalibrationBeyondAIME (test)
SNR Gain1.202
15
ReasoningBeyondAIME
Pass@170.38
14
MathematicsBeyondAIME
Avg@1066.56
9
Mathematical ReasoningBeyondAIME
Pass@18.3
8
Claim-level Confidence CalibrationBeyondAIME
SNR Gain0.301
7
Tool-integrated Mathematical ReasoningBeyondAIME
Pass@141
6
Mathematical ReasoningBeyondAIME
Mean@1071.8
4
Mathematical ReasoningBeyondAIME
Pass@1627.84
3
Mathematical ReasoningBeyondAIME
pass@6431.3
3
Mathematical ReasoningBeyondAIME
Turn 1 Score2
2
Showing 12 of 12 rows