Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SimpleQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringSimpleQA
Accuracy95.3
114
Question AnsweringSimpleQA Verified
Accuracy82.5
60
Agentic EvaluationSimpleQA
Accuracy84
50
Multi-hop Question AnsweringSimpleQA
Pass@118.8
36
Knowledge Graph Question AnsweringSimpleQA Freebase-based (test)
Hits@184.8
31
Ranking Stability AnalysisSimpleQA and 4 Hallucination Benchmarks
Kendall's W0.9
28
Confidence CalibrationSimpleQA
Brier Score0.0386
27
Hallucination self-detectionSimpleQA
AUROC95.9
27
Factual QASimpleQA
Accuracy5.92
24
Question Answering with AbstentionSimpleQA
BAS0.2
24
Uncertainty EstimationSimpleQA
AUROC61.3
24
Closed-book Question AnsweringSimpleQA (train)
Accuracy18.8
21
Question AnsweringSimpleQA
Accuracy32.9
20
HelpfulnessSimpleQA
Accuracy6.64
20
Short-form Question AnsweringSimpleQA
ECE7.9
18
Question AnsweringSimpleQA-verified OOD
Accuracy42.2
18
Tool UseSimpleQA
Accuracy91.5
12
Question AnsweringSimpleQA
Score56.3
11
Question AnsweringSimpleQA
pass@12853.97
11
Question AnsweringSimpleQA out-domain (test)
LasJ36.5
11
Generative Question AnsweringSimpleQA
HALL Score92
10
Watermark DetectionSimpleQA
Delta_q0.81
10
Factual Question AnsweringSimpleQA (test)
Accuracy79.07
10
web-agent QASimpleQA
F1 (Avg)67.8
8
Question AnsweringSimpleQA
F1 Score56.1
7
Showing 25 of 44 rows