Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Knowledge-intensive Reasoning on 2Wiki (F1, EM, TC)
Loading...
52
F1 Score
EAPO
22.568
30.209
37.85
45.491
Jun 1, 2026
F1 Score
Exact Match (EM)
TC Score
Updated 23h ago
Evaluation Results
Method
Method
Links
F1 Score
Exact Match (EM)
TC Score
EAPO
Backbone=Llama3.1-8B-I...
2026.06
52
44.8
1.69
GRPO
Backbone=Llama3.1-8B-I...
2026.06
48
40.7
3.27
Reinforce++
Backbone=Llama3.1-8B-I...
2026.06
47.1
38.4
2.13
TIR
Backbone=Llama3.1-8B-I...
2026.06
33.7
28.6
2.98
Base
Backbone=Llama3.1-8B-I...
2026.06
23.7
10.2
-
Feedback
Search any
task
Search any
task