Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

About

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy• 2026

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU	MMLU Accuracy57.1	456
Logic reasoning	Tracking Shuffled Objects BBH	Accuracy20.4	59
Causal Reasoning	BBH Causal Judgement	Accuracy (BBH Causal Judgement)58.4	40
Logical reasoning	BigBench Hard Boolean Expressions	Accuracy76.8	17
Linguistic Reasoning	BigBench Hard Disambiguation QA	Accuracy55.1	5
Prompt Optimization	10-task prompt optimization suite GSM8K MMLU BBH	Average Win/Tie Rate81	5
Reasoning	BigBench Hard Penguins	Accuracy44.1	5
Linguistic Reasoning	BigBench Hard Snarks	Accuracy0.551	5
Linguistic Reasoning	BigBench Hard Hyperbaton	Accuracy79.6	5
Logical reasoning	BigBench Hard Formal Fallacies	Accuracy58.1	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord