Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool Calling on HotpotQA (evaluation)
Loading...
70.4
Accuracy
Baseline (normal SFT)
11.64
26.895
42.15
57.405
May 13, 2026
Accuracy
Latency
Updated 20d ago
Evaluation Results
Method
Method
Links
Accuracy
Latency
Baseline (normal SFT)
Model=Llama-3.2-3B
2026.05
70.4
2.3
AsyncIO (Async-SFT)
Model=Llama-3.2-3B
2026.05
68.7
1.1
Baseline (normal SFT)
Model=Qwen2.5-3B
2026.05
68.6
2.7
AsyncIO (Async-SFT)
Model=Qwen2.5-3B
2026.05
67.5
1.2
AsyncIO (normal SFT)
Model=Qwen2.5-3B
2026.05
33.4
-
AsyncIO (normal SFT)
Model=Llama-3.2-3B
2026.05
23.6
-
AsyncIO (no SFT)
Model=Qwen2.5-3B
2026.05
20.1
-
Baseline (no SFT)
Model=Qwen2.5-3B
2026.05
17.2
-
Baseline (no SFT)
Model=Llama-3.2-3B
2026.05
15.1
-
AsyncIO (no SFT)
Model=Llama-3.2-3B
2026.05
13.9
-
Feedback
Search any
task
Search any
task