Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Knowledge-Intensive Reasoning on WebWalker
Loading...
30.5
F1 Score
Llama3.1-8B + ARPO
-0.7
7.4
15.5
23.6
Dec 11, 2025
F1 Score
Updated 2d ago
Evaluation Results
Method
Method
Links
F1 Score
Llama3.1-8B + ARPO
Base Model=Llama3.1-8B...
2025.12
30.5
Llama3.1-8B + Reinforce ++
Base Model=Llama3.1-8B...
2025.12
27.5
Llama3.1-8B + GRPO
Base Model=Llama3.1-8B...
2025.12
26.5
Qwen2.5-7B + Reinforce ++
Base Model=Qwen2.5-7B,...
2025.12
26
Qwen2.5-7B + ARPO
Base Model=Qwen2.5-7B,...
2025.12
26
Llama3.1-8B + DAPO
Base Model=Llama3.1-8B...
2025.12
25.5
Qwen2.5-3B + ARPO
Base Model=Qwen2.5-3B,...
2025.12
24.5
Qwen2.5-7B + DAPO
Base Model=Qwen2.5-7B,...
2025.12
24
Qwen2.5-7B + GRPO
Base Model=Qwen2.5-7B,...
2025.12
22
Qwen2.5-3B + GRPO
Base Model=Qwen2.5-3B,...
2025.12
21
Qwen2.5-3B + Reinforce ++
Base Model=Qwen2.5-3B,...
2025.12
19.5
Qwen2.5-3B + DAPO
Base Model=Qwen2.5-3B,...
2025.12
19.5
Qwen2.5-7B + TIR Prompting
Base Model=Qwen2.5-7B,...
2025.12
15.5
Llama3.1-8B + TIR Prompting
Base Model=Llama3.1-8B...
2025.12
15
Qwen2.5-3B + TIR Prompting
Base Model=Qwen2.5-3B,...
2025.12
14
Llama3.1-8B
Base Model=Llama3.1-8B...
2025.12
3
Qwen2.5-7B
Base Model=Qwen2.5-7B,...
2025.12
2
Qwen2.5-3B
Base Model=Qwen2.5-3B,...
2025.12
0.5
Feedback
Search any
task
Search any
task