Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use Task Completion on Tau-Bench Airline v2 (test)
Loading...
70
Pass Rate
Claude Sonnet 4.5
11.76
26.88
42
57.12
Apr 3, 2026
Pass Rate
Delta Improvement
Updated 13d ago
Evaluation Results
Method
Method
Links
Pass Rate
Delta Improvement
Claude Sonnet 4.5
2026.04
70
-
Qwen3-30B-A3B + IRC
Size=30.5B
2026.04
69.5
11.5
Qwen3-30B-A3B + MT-GRPO
Size=30.5B
2026.04
68
10
Qwen3.5-4B + IRC
Size=4B
2026.04
66.7
2.9
Qwen3.5-4B + MT-GRPO
Size=4B
2026.04
64.6
0.8
Qwen3.5-4B (base)
Size=4B
2026.04
63.8
-
Qwen3-30B-A3B (base)
Size=30.5B
2026.04
58
-
GPT-4.1
2026.04
49.4
-
GPT-4o
2026.04
42.8
-
Claude 3.5 Haiku
2026.04
22.8
-
GPT-4.1 nano
2026.04
14
-
Feedback
Search any
task
Search any
task