Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level correctness prediction on GTA
Loading...
92.17
AUROC
PAIR
74.2196
78.8798
83.54
88.2002
May 18, 2026
AUROC
ECE
Updated 15d ago
Evaluation Results
Method
Method
Links
AUROC
ECE
PAIR
Probing Category=Prefi...
2026.05
92.17
11.98
Multi-layer
Probing Category=Hidde...
2026.05
90.82
13.42
Multi-Attn
Probing Category=Atten...
2026.05
90.34
13.56
Hidden + Attn
Probing Category=Atten...
2026.05
89.78
13.78
Last-token
Probing Category=Hidde...
2026.05
89.67
12.94
Mean-Pooled
Probing Category=Hidde...
2026.05
88.51
15.21
Head Entropy
Probing Category=Unsup...
2026.05
88.21
14.32
Lookback Lens
Probing Category=Unsup...
2026.05
87.45
11.82
Attention
Probing Category=Atten...
2026.05
87.09
9.54
CoE-C
Probing Category=Unsup...
2026.05
74.91
-
Feedback
Search any
task
Search any
task