Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning Quality Evaluation on 120-tool benchmark 500 tasks simulated
Loading...
4.43
Mean Score
Tool Attention
3.1612
3.4906
3.82
4.1494
Apr 23, 2026
Mean Score
Standard Deviation (SD)
Proportion Scoring ≥ 4
Updated 1mo ago
Evaluation Results
Method
Method
Links
Mean Score
Standard Deviation (SD)
Proportion Scoring ≥ 4
Tool Attention
Type=Projected
2026.04
4.43
0.62
87.6
B4 CLI Lazy
Type=Projected
2026.04
4.02
0.77
74.1
B3 Simple Retrieval
Type=Projected
2026.04
3.89
0.81
68.7
B2 Static Pruning
Type=Projected
2026.04
3.35
0.98
48
B1 Full-Schema
Type=Projected
2026.04
3.21
1.04
43.2
Feedback
Search any
task
Search any
task