Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SimpleQA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringSimpleQA
Accuracy95.3
94
Question AnsweringSimpleQA Verified
Accuracy82.5
60
Agentic EvaluationSimpleQA
Accuracy84
50
Confidence CalibrationSimpleQA
Brier Score0.0386
27
Hallucination self-detectionSimpleQA
AUROC95.9
27
Uncertainty EstimationSimpleQA
AUROC61.3
24
HelpfulnessSimpleQA
Accuracy6.64
20
Short-form Question AnsweringSimpleQA
ECE7.9
18
Question AnsweringSimpleQA-verified OOD
Accuracy42.2
18
Question AnsweringSimpleQA
Accuracy32.9
12
Tool UseSimpleQA
Accuracy91.5
12
Question AnsweringSimpleQA
pass@12853.97
11
Question AnsweringSimpleQA out-domain (test)
LasJ36.5
11
Watermark DetectionSimpleQA
Delta_q0.81
10
Factual Question AnsweringSimpleQA (test)
Accuracy79.07
10
web-agent QASimpleQA
F1 (Avg)67.8
8
Question AnsweringSimpleQA
EM58.3
7
Fact-seeking Question AnsweringSimpleQA no web
Accuracy55
7
FactualitySimpleQA
Factuality Score35.3
7
Confidence CalibrationSimpleQA (test)
ECE6.8
7
Knowledge RetrievalSimpleQA
Accuracy74.01
5
Agent Trajectory PerformanceSimpleQA (test)
Pass@1 Accuracy77.1
4
Open-domain Factual Question AnsweringSimpleQA
Accuracy3.07
3
Question AnsweringSimpleQA (test)
SimpleQA Score4.05
3
Short-form QASimpleQA
Accuracy4
2
Showing 25 of 26 rows