Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop reasoning on 2WikiMultihopQA
Loading...
48.44
Exact Match (EM)
Prompt-R1
17.5624
25.5787
33.595
41.6113
Nov 2, 2025
Exact Match (EM)
F1 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
F1 Score
Prompt-R1
Backbone=GPT-4o-mini
2025.11
48.44
54.41
CoT Reasoning
Backbone=GPT-4o-mini
2025.11
43.75
49.13
SFT
Backbone=Qwen3-4B
2025.11
41.41
42.62
GEPA
Optimization Framework...
2025.11
41.41
46.27
GRPO
Backbone=Qwen3-4B
2025.11
34.38
35.05
Baseline
Backbone=GPT-4o-mini
2025.11
33.59
36.57
Baseline
Backbone=Qwen3-4B
2025.11
28.13
29.32
OPRO
Optimization Framework...
2025.11
25
35.96
CoT Reasoning
Backbone=Qwen3-4B
2025.11
21.88
24.17
TextGrad
Optimization Framework...
2025.11
18.75
27.5
Feedback
Search any
task
Search any
task