Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use Agent Evaluation on τ-bench Retail (Last 105 Tasks)
Loading...
42.5
Pass@1
POLCA
39.068
39.959
40.85
41.741
Mar 16, 2026
Pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
POLCA
2026.03
42.5
OpenEvolve
2026.03
42.2
GEPA
2026.03
41.7
Base Prompt
2026.03
39.2
Feedback
Search any
task
Search any
task