Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Standard Question Answering on SQuAD v2
Loading...
19.53
EM
Prompt-R1
4.9076
8.7038
12.5
16.2962
Nov 2, 2025
EM
F1
Updated 1mo ago
Evaluation Results
Method
Method
Links
EM
F1
Prompt-R1
2025.11
19.53
29.28
CoT Reasoning
Backbone=GPT-4o-mini
2025.11
14.06
25.73
Baseline
Backbone=GPT-4o-mini
2025.11
13.28
25.61
GEPA
Category=APO (GPT-4o-m...
2025.11
13.28
25.52
OPRO
Category=APO (GPT-4o-m...
2025.11
10.94
26.67
GRPO
Backbone=Qwen3-4B
2025.11
10.16
23.1
Baseline
Backbone=Qwen3-4B
2025.11
6.25
16.09
CoT Reasoning
Backbone=Qwen3-4B
2025.11
6.25
16.25
TextGrad
Category=APO (GPT-4o-m...
2025.11
6.25
22.04
SFT
Backbone=Qwen3-4B
2025.11
5.47
16.18
Feedback
Search any
task
Search any
task