Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Knowledge-Intensive Reasoning on 2wikiMultiHopQA (F1 Score)
Loading...
76.1
F1 Score
Qwen2.5-7B + GRPO
6.732
24.741
42.75
60.759
Dec 11, 2025
F1 Score
Updated 2d ago
Evaluation Results
Method
Method
Links
F1 Score
Qwen2.5-7B + GRPO
Base Model=Qwen2.5-7B,...
2025.12
76.1
Qwen2.5-7B + ARPO
Base Model=Qwen2.5-7B,...
2025.12
76.1
Llama3.1-8B + ARPO
Base Model=Llama3.1-8B...
2025.12
75.5
Llama3.1-8B + GRPO
Base Model=Llama3.1-8B...
2025.12
71.8
Llama3.1-8B + Reinforce ++
Base Model=Llama3.1-8B...
2025.12
71.6
Llama3.1-8B + DAPO
Base Model=Llama3.1-8B...
2025.12
70.3
Qwen2.5-7B + Reinforce ++
Base Model=Qwen2.5-7B,...
2025.12
68.9
Qwen2.5-7B + DAPO
Base Model=Qwen2.5-7B,...
2025.12
68.4
Qwen2.5-3B + ARPO
Base Model=Qwen2.5-3B,...
2025.12
67.4
Qwen2.5-3B + GRPO
Base Model=Qwen2.5-3B,...
2025.12
64.5
Qwen2.5-3B + DAPO
Base Model=Qwen2.5-3B,...
2025.12
62.5
Qwen2.5-3B + Reinforce ++
Base Model=Qwen2.5-3B,...
2025.12
62.3
Llama3.1-8B + TIR Prompting
Base Model=Llama3.1-8B...
2025.12
47.5
Llama3.1-8B
Base Model=Llama3.1-8B...
2025.12
24.6
Qwen2.5-7B + TIR Prompting
Base Model=Qwen2.5-7B,...
2025.12
18.3
Qwen2.5-3B + TIR Prompting
Base Model=Qwen2.5-3B,...
2025.12
14.1
Qwen2.5-7B
Base Model=Qwen2.5-7B,...
2025.12
12.6
Qwen2.5-3B
Base Model=Qwen2.5-3B,...
2025.12
9.4
Feedback
Search any
task
Search any
task