Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Correctness Prediction on AIME
Loading...
0.838
WP-AUC
Op-XGB
0.48648
0.57774
0.669
0.76026
May 28, 2026
WP-AUC
Updated 5d ago
Evaluation Results
Method
Method
Links
WP-AUC
Op-XGB
Evaluation Protocol=ID
2026.05
0.838
OST
Evaluation Protocol=CD
2026.05
0.801
OST
Evaluation Protocol=ID
2026.05
0.797
Op-XGB
Evaluation Protocol=CD
2026.05
0.779
Wait
Evaluation Protocol=CD
2026.05
0.679
Wait
Evaluation Protocol=ID
2026.05
0.679
Backtrack
Evaluation Protocol=CD
2026.05
0.621
Backtrack
Evaluation Protocol=ID
2026.05
0.621
SelfCheck
Evaluation Protocol=CD
2026.05
0.501
SelfCheck
Evaluation Protocol=ID
2026.05
0.501
Length
Evaluation Protocol=CD
2026.05
0.5
Length
Evaluation Protocol=ID
2026.05
0.5
Feedback
Search any
task
Search any
task