Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Task on ToolBench
Loading...
44.98
Success Rate
PAIR
34.2992
37.0721
39.845
42.6179
May 18, 2026
Success Rate
Updated 15d ago
Evaluation Results
Method
Method
Links
Success Rate
PAIR
2026.05
44.98
LLM-as-a-judge
2026.05
42.03
CoE-C
2026.05
40.34
IGPO
2026.05
39.52
Tree-GRPO
2026.05
39.35
Lookback-ratio
2026.05
39.34
Mean-pooled
2026.05
39.17
Outcome
2026.05
38.78
Head-entropy
2026.05
38.68
Multi-layer
2026.05
38.56
Last-token
2026.05
38.37
AT2PO
2026.05
38.12
Hidden+Attn
2026.05
37.84
Multi-Attn
2026.05
36.14
TIPS
2026.05
34.83
Attention
2026.05
34.71
Feedback
Search any
task
Search any
task