Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BeyondAIME

Benchmarks

Task NameDataset NameSOTA ResultTrend
Confidence CalibrationBeyondAIME (test)
SNR Gain1.202
15
ReasoningBeyondAIME
Pass@170.38
14
MathematicsBeyondAIME
Avg@1066.56
9
Mathematical ReasoningBeyondAIME
avg@1661.7
8
Claim-level Confidence CalibrationBeyondAIME
SNR Gain0.301
7
Mathematical ReasoningBeyondAIME
Mean@1071.8
4
Showing 6 of 6 rows