Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Measuring short-form factuality in large language models

About

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus• 2024

Related benchmarks

TaskDatasetResultRank
Reasoning Token InductionMixed Prompts (SimpleQA, SimpleBench, AIME2024, etc.) (test)
Mean Completion Tokens4.46e+3
31
Ranking Stability AnalysisSimpleQA and 4 Hallucination Benchmarks
Kendall's W0.9
28
Ranking Consistency AnalysisMMLU-Pro health Virology
Spearman Correlation0.35
8
Ranking Consistency AnalysisMMLU-Pro Medical genetics health
Spearman Correlation0.00e+0
8
Physical chemistryChemBench Physical Chemistry
Spearman Correlation-0.42
8
Technical chemistryChemBench Technical Chemistry
Spearman Correlation-0.31
8
Ranking Consistency AnalysisMMLU-Pro Nutrition health
Spearman Correlation0.11
8
Analytical chemistryChemBench Analytical Chemistry
Spearman Correlation-0.67
8
Inorganic chemistryChemBench Inorganic Chemistry
Spearman Correlation-0.69
8
Material scienceChemBench Material science
Spearman Correlation-0.78
8
Showing 10 of 13 rows

Other info

Follow for update