Measuring short-form factuality in large language models
About
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning Token Induction | Mixed Prompts (SimpleQA, SimpleBench, AIME2024, etc.) (test) | Mean Completion Tokens4.46e+3 | 31 | |
| Ranking Stability Analysis | SimpleQA and 4 Hallucination Benchmarks | Kendall's W0.9 | 28 | |
| Ranking Consistency Analysis | MMLU-Pro health Virology | Spearman Correlation0.35 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Medical genetics health | Spearman Correlation0.00e+0 | 8 | |
| Physical chemistry | ChemBench Physical Chemistry | Spearman Correlation-0.42 | 8 | |
| Technical chemistry | ChemBench Technical Chemistry | Spearman Correlation-0.31 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Nutrition health | Spearman Correlation0.11 | 8 | |
| Analytical chemistry | ChemBench Analytical Chemistry | Spearman Correlation-0.67 | 8 | |
| Inorganic chemistry | ChemBench Inorganic Chemistry | Spearman Correlation-0.69 | 8 | |
| Material science | ChemBench Material science | Spearman Correlation-0.78 | 8 |