Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level Correctness Prediction on ToolBench
Loading...
0.8129
AUROC
PAIR
0.586596
0.645348
0.7041
0.762852
May 18, 2026
AUROC
ECE
Updated 15d ago
Evaluation Results
Method
Method
Links
AUROC
ECE
PAIR
Probing Category=Prefi...
2026.05
0.8129
22.36
Multi-layer
Probing Category=Hidde...
2026.05
0.7821
24.76
Multi-Attn
Probing Category=Atten...
2026.05
0.7765
19.61
Last-token
Probing Category=Hidde...
2026.05
0.7654
25.62
Hidden + Attn
Probing Category=Atten...
2026.05
0.7598
26.83
Head Entropy
Probing Category=Unsup...
2026.05
0.7561
23.29
Mean-Pooled
Probing Category=Hidde...
2026.05
0.7372
28.71
Lookback Lens
Probing Category=Unsup...
2026.05
0.7295
16.85
Attention
Probing Category=Atten...
2026.05
0.6847
19.98
CoE-C
Probing Category=Unsup...
2026.05
0.5953
-
Feedback
Search any
task
Search any
task