Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
General Question Answering on PopQA out-of-domain (val test)
Loading...
50.1
Exact Match (EM)
Search-R2
3.612
15.681
27.75
39.819
Feb 3, 2026
Exact Match (EM)
Updated 4d ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
Search-R2
Backbone=Qwen2.5-32B
2026.02
50.1
Search-R1
Backbone=Qwen2.5-32B
2026.02
47
Search-R2
Backbone=Qwen3-8B
2026.02
46.6
Search-R1
Backbone=Qwen3-8B
2026.02
41.8
Search-R2
Backbone=Qwen2.5-7B
2026.02
41
RAG
Backbone=Qwen2.5-7B
2026.02
39.2
Search-R1
Backbone=Qwen2.5-7B
2026.02
38.8
Rejection Sampling
Backbone=Qwen2.5-7B
2026.02
38
IRCoT
Backbone=Qwen2.5-7B
2026.02
30.1
R1-base
Backbone=Qwen2.5-7B
2026.02
20.2
R1-instruct
Backbone=Qwen2.5-7B
2026.02
19.9
Direct Inference
Backbone=Qwen2.5-7B
2026.02
14
Search-o1
Backbone=Qwen2.5-7B
2026.02
13.1
SFT
Backbone=Qwen2.5-7B
2026.02
12.1
CoT
Backbone=Qwen2.5-7B
2026.02
5.4
Feedback
Search any
task
Search any
task