Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Structured Prompts Improve Evaluation of Language Models

About

As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores can reflect prompt choice as much as model capability. Declarative prompting frameworks such as DSPy offer a scalable way to evaluate models under a set of structured prompting strategies rather than a static prompt configuration. We present a reproducible DSPy+HELM framework for studying how prompt choice impacts reported benchmark outcomes. Using five prompting methods, we evaluate four frontier and two open-source LMs across seven benchmarks against existing HELM baseline scores. By evaluating LMs across a family of prompt configurations, we find that prompt choice can materially impact leaderboard outcomes. In particular, structured prompting improves performance (by 6% on average), alters comparisons (leaderboard rankings shift on 5/7 benchmarks), with most gains coming from introducing chain-of-thought, and little additional benefit from more advanced optimizers. To our knowledge, this is the first study to systematically integrate structured prompting into an established evaluation framework and quantify how prompt choice alone can impact benchmark conclusions. We open-source (i) DSPy+HELM Evaluation (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari• 2025

Related benchmarks

TaskDatasetResultRank
Graduate-level Question AnsweringGPQA
Accuracy68.4
184
Medical Question AnsweringMedbullets
Accuracy82.5
65
Massive Multitask Language UnderstandingMMLU-Pro
Accuracy (MMLU-Pro)80.6
38
Medical Question AnsweringMedec
Accuracy69.2
30
Language ModelingHELM macro-averaged (test)
Accuracy73.1
30
Medical Question AnsweringMedCalc-Bench
Accuracy34.7
30
Medical Question AnsweringHeadQA
Accuracy92.2
30
Showing 7 of 7 rows

Other info

Follow for update