Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use agent evaluation on τ-bench airline
Loading...
30.4
Pass@1
ReAct
24.16
25.78
27.4
29.02
Apr 28, 2026
Pass@1
Pass@2
Pass@3
Pass@4
Pass@5
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
Pass@2
Pass@3
Pass@4
Pass@5
ReAct
Backbone=Qwen3-32B
2026.04
30.4
20
16.2
14.8
14
FAMA
Backbone=Qwen2.5-72B-I...
2026.04
29.2
21.2
18.8
18
18
SR
Backbone=Qwen2.5-72B-I...
2026.04
28
20.4
17.4
15.6
14
FAMA
Backbone=Qwen3-32B
2026.04
26.8
20
18.4
18
18
SR
Backbone=Qwen3-32B
2026.04
25.2
17
13.2
10.4
8
ReAct
Backbone=Qwen2.5-72B-I...
2026.04
24.4
18.79
15.6
12.8
10
Feedback
Search any
task
Search any
task