Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Knowledge-Intensive Reasoning on Bamboogle (F1 score)
Loading...
73.8
F1
Llama3.1-8B + ARPO
9.216
25.983
42.75
59.517
Dec 11, 2025
F1
Updated 2d ago
Evaluation Results
Method
Method
Links
F1
Llama3.1-8B + ARPO
Base Model=Llama3.1-8B...
2025.12
73.8
Qwen2.5-7B + ARPO
Base Model=Qwen2.5-7B,...
2025.12
71.5
Llama3.1-8B + Reinforce ++
Base Model=Llama3.1-8B...
2025.12
69.1
Qwen2.5-7B + GRPO
Base Model=Qwen2.5-7B,...
2025.12
68.4
Llama3.1-8B + GRPO
Base Model=Llama3.1-8B...
2025.12
68.2
Llama3.1-8B + DAPO
Base Model=Llama3.1-8B...
2025.12
67.3
Qwen2.5-3B + ARPO
Base Model=Qwen2.5-3B,...
2025.12
66.8
Qwen2.5-3B + Reinforce ++
Base Model=Qwen2.5-3B,...
2025.12
65.7
Qwen2.5-7B + DAPO
Base Model=Qwen2.5-7B,...
2025.12
65.5
Qwen2.5-3B + GRPO
Base Model=Qwen2.5-3B,...
2025.12
65.2
Qwen2.5-7B + Reinforce ++
Base Model=Qwen2.5-7B,...
2025.12
64.9
Qwen2.5-3B + DAPO
Base Model=Qwen2.5-3B,...
2025.12
64.8
Llama3.1-8B + TIR Prompting
Base Model=Llama3.1-8B...
2025.12
58.4
Llama3.1-8B
Base Model=Llama3.1-8B...
2025.12
40
Qwen2.5-7B
Base Model=Qwen2.5-7B,...
2025.12
24
Qwen2.5-7B + TIR Prompting
Base Model=Qwen2.5-7B,...
2025.12
23.6
Qwen2.5-3B + TIR Prompting
Base Model=Qwen2.5-3B,...
2025.12
16.4
Qwen2.5-3B
Base Model=Qwen2.5-3B,...
2025.12
11.7
Feedback
Search any
task
Search any
task