Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

2Wiki

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question Answering2WIKI
F172.37
75
Multi-hop Question Answering2Wiki
F1 Score70.58
41
Multi-hop QA2Wiki
EM62
26
Multi-hop Question Answering2Wiki (test)
F1 Score69.7
20
Question Answering2Wiki 30K context
Accuracy73.7
19
Question Answering2Wiki 10K context
Accuracy72.2
19
Question Answering2Wiki 100K context
Accuracy65.5
18
Multi-Hop QA Verification2wiki
P@181.21
16
Question Answering2WIKI (val)
EM27.2
14
Question Answering2WIKI (out-of-domain)
EM40
14
Question Answering2Wiki 500 samples (val)
EM39.6
14
Query Rewriting & QA2Wiki BM25
F133.6
12
Open-domain Question Answering2WIKI
Accuracy48.9
11
Multi-Hop Question Answering2Wiki (out-of-domain)
Accuracy42
10
Expected Calibration Error2Wiki
ECE17.43
10
Multi-Hop QA2Wiki (test)
EM57.5
10
Question Answering2Wiki Normal
F1 Score23.63
8
Deep Research2WIKI (test)
Mean Correct Rate0.92
8
Question Answering2Wiki Extreme
F1 Score26.99
7
Question Answering2Wiki Noisy
F1 Score24.17
7
Multi-hop Question Answering2Wiki
FCR67.6
7
Retrieval2Wiki
Recall@587
7
Multi-Hop Question Answering2Wiki Platinum (test)
Answer Rate84.8
6
Short-form Question Answering2wiki (test)
EM26.5
5
Question Answering2Wiki 1,000 samples (test)
F1 Score0.388
3
Showing 25 of 29 rows