Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Knowledge-Intensive Reasoning on MuSiQue (F1 score)
Loading...
34.8
F1 Score
Llama3.1-8B + ARPO
2.352
10.776
19.2
27.624
Dec 11, 2025
F1 Score
Updated 2d ago
Evaluation Results
Method
Method
Links
F1 Score
Llama3.1-8B + ARPO
Base Model=Llama3.1-8B...
2025.12
34.8
Qwen2.5-7B + ARPO
Base Model=Qwen2.5-7B,...
2025.12
31.1
Llama3.1-8B + GRPO
Base Model=Llama3.1-8B...
2025.12
31
Qwen2.5-7B + GRPO
Base Model=Qwen2.5-7B,...
2025.12
30.6
Qwen2.5-3B + DAPO
Base Model=Qwen2.5-3B,...
2025.12
30
Llama3.1-8B + Reinforce ++
Base Model=Llama3.1-8B...
2025.12
29.9
Llama3.1-8B + DAPO
Base Model=Llama3.1-8B...
2025.12
29.2
Qwen2.5-3B + ARPO
Base Model=Qwen2.5-3B,...
2025.12
28.7
Qwen2.5-7B + DAPO
Base Model=Qwen2.5-7B,...
2025.12
28.6
Qwen2.5-3B + Reinforce ++
Base Model=Qwen2.5-3B,...
2025.12
27.9
Qwen2.5-7B + Reinforce ++
Base Model=Qwen2.5-7B,...
2025.12
25.2
Qwen2.5-3B + GRPO
Base Model=Qwen2.5-3B,...
2025.12
24.7
Llama3.1-8B + TIR Prompting
Base Model=Llama3.1-8B...
2025.12
15.5
Llama3.1-8B
Base Model=Llama3.1-8B...
2025.12
10.4
Qwen2.5-7B + TIR Prompting
Base Model=Qwen2.5-7B,...
2025.12
9.5
Qwen2.5-7B
Base Model=Qwen2.5-7B,...
2025.12
6.6
Qwen2.5-3B + TIR Prompting
Base Model=Qwen2.5-3B,...
2025.12
6.1
Qwen2.5-3B
Base Model=Qwen2.5-3B,...
2025.12
3.6
Feedback
Search any
task
Search any
task