Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-hop tool use on ToolHop
Loading...
46.16
Answer Correctness
MatchTIR (KM)
29.6448
33.9324
38.22
42.5076
Jan 15, 2026
Answer Correctness
Updated 4d ago
Evaluation Results
Method
Method
Links
Answer Correctness
MatchTIR (KM)
Backbone=Qwen3-8B
2026.01
46.16
MatchTIR (OT)
Backbone=Qwen3-8B
2026.01
45.8
FTRL-M
Backbone=Qwen3-8B
2026.01
43.32
MatchTIR (KM)
Backbone=Qwen3-4B
2026.01
42.55
ToolRL-M
Backbone=Qwen3-8B
2026.01
42.55
Vanilla
Backbone=Qwen3-8B
2026.01
42.21
MatchTIR (OT)
Backbone=Qwen3-4B
2026.01
41.95
FTRL-M
Backbone=Qwen3-4B
2026.01
41.24
GRPO
Backbone=Qwen3-8B
2026.01
40.64
FTRL-S
Backbone=Qwen3-4B
2026.01
38.63
GRPO
Backbone=Qwen3-4B
2026.01
37.25
FTRL-S
Backbone=Qwen3-8B
2026.01
36.72
ToolRL-M
Backbone=Qwen3-4B
2026.01
35.68
ToolRL-S
Backbone=Qwen3-8B
2026.01
32.93
Vanilla
Backbone=Qwen3-4B
2026.01
31.63
ToolRL-S
Backbone=Qwen3-4B
2026.01
30.28
Feedback
Search any
task
Search any
task