Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Compositional Multi-hop QA on Bamboogle (Success Rate)
Loading...
77.6
Success Rate
Llama 3.1 405B
27.68
40.64
53.6
66.56
Dec 22, 2025
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
Llama 3.1 405B
Scale group=Large Scal...
2025.12
77.6
Llama 3.1 70B
Scale group=Large Scal...
2025.12
76.8
Qwen 3 14B
Scale group=Large Scal...
2025.12
76
GenEnv
Scale group=7B Models,...
2025.12
76
Qwen 3 32B
Scale group=Large Scal...
2025.12
71.2
Qwen 2.5 72B
Scale group=Large Scal...
2025.12
69.6
ReSearch
Scale group=7B Models,...
2025.12
68
Qwen 2.5 7B
Scale group=7B Models,...
2025.12
68
SearchR1
Scale group=7B Models,...
2025.12
67.2
ToRL
Scale group=7B Models,...
2025.12
34.4
GPT-OSS 20B
Scale group=Large Scal...
2025.12
33.6
GPT-OSS 120B
Scale group=Large Scal...
2025.12
29.6
Feedback
Search any
task
Search any
task