Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SQuAD

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringSQuAD v1.1 (dev)
F1 Score95.8
380
Question AnsweringSQuAD v1.1 (test)
F1 Score95.4
260
Question AnsweringSQuAD 2.0
F189.4
215
Question AnsweringSQuAD v2.0 (dev)
F191.2
163
Question AnsweringSQuAD
F189.8
162
Question AnsweringSQuAD (test)
F191.2
156
Prompt Injection DefenseInj-SQuAD
Combined ASR0.11
123
Question AnsweringSQuAD v1.1
F194.7
85
Question AnsweringSQuAD
Exact Match93.33
83
Hallucination DetectionSQuAD
AUROC0.89
82
Question AnsweringSQuAD (dev)
F191
74
Question AnsweringSQuAD
ACE (General)0.112
70
Question AnsweringSQuAD v1.1 (val)
F1 Score96.22
70
Question AnsweringSQuAD
F1 Score94.7
63
Machine Reading ComprehensionSQuAD
EM89.9
58
Machine Reading ComprehensionSQuAD 2.0 (dev)
EM88.8
57
GenerationSQuAD
F1 Score88.3
52
Machine Reading ComprehensionSQuAD 2.0 (test)
EM89.6
51
Hallucination DetectionSQuAD (test)
AUROCr83.8
48
Machine Reading ComprehensionSQuAD 1.1 (dev)
EM89.71
48
Machine Reading ComprehensionSQuAD 1.1 (test)
EM89.898
46
Question AnsweringSQuAD (test)
GPT Judge Accuracy89
45
Hallucination detectionSQuAD
AUC85.5
40
Reading ComprehensionSQuAD
Attack Accuracy75.91
40
Membership Inference AttackSQuAD
AUC0.883
39
Showing 25 of 235 rows
...