Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool Selection on TaskBench-MM
Loading...
20.7
F1 Score
OLIVIA
2.396
7.148
11.9
16.652
May 11, 2026
F1 Score
Updated 21d ago
Evaluation Results
Method
Method
Links
F1 Score
OLIVIA
Backbone=Mistral-7B-v0.1
2026.05
20.7
OLIVIA
Backbone=Qwen3-4B
2026.05
17.8
CLIN
Backbone=Qwen3-4B
2026.05
17
ReAct
Backbone=Qwen3-4B
2026.05
16.2
ReAct
Backbone=Mistral-7B-v0.1
2026.05
14.6
BM25
Backbone=Qwen3-4B
2026.05
12.7
BM25
Backbone=Mistral-7B-v0.1
2026.05
12.7
CLIN
Backbone=Mistral-7B-v0.1
2026.05
12.7
Bandit
Backbone=Qwen3-4B
2026.05
12.6
Bandit
Backbone=Mistral-7B-v0.1
2026.05
12.6
CoT
Backbone=Mistral-7B-v0.1
2026.05
6.4
CoT
Backbone=Qwen3-4B
2026.05
3.1
Feedback
Search any
task
Search any
task