Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool selection on TaskBench
Loading...
21.1
F1 Score
CLIN
2.276
7.163
12.05
16.937
May 11, 2026
F1 Score
Updated 21d ago
Evaluation Results
Method
Method
Links
F1 Score
CLIN
Backbone=Qwen3-4B
2026.05
21.1
OLIVIA
Backbone=Mistral-7B-v0.1
2026.05
20.8
OLIVIA
Backbone=Qwen3-4B
2026.05
20.5
CLIN
Backbone=Mistral-7B-v0.1
2026.05
19.3
ReAct
Backbone=Mistral-7B-v0.1
2026.05
17.8
ReAct
Backbone=Qwen3-4B
2026.05
17.4
Bandit
Backbone=Qwen3-4B
2026.05
15.5
Bandit
Backbone=Mistral-7B-v0.1
2026.05
15.5
BM25
Backbone=Qwen3-4B
2026.05
11.6
BM25
Backbone=Mistral-7B-v0.1
2026.05
11.6
CoT
Backbone=Mistral-7B-v0.1
2026.05
3.6
CoT
Backbone=Qwen3-4B
2026.05
3
Feedback
Search any
task
Search any
task