Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

About

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy• 2026

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU
MMLU Accuracy57.1
442
Logic reasoningTracking Shuffled Objects BBH
Accuracy20.4
59
Causal ReasoningBBH Causal Judgement
Accuracy (BBH Causal Judgement)58.4
40
Logical reasoningBigBench Hard Boolean Expressions
Accuracy76.8
17
Linguistic ReasoningBigBench Hard Disambiguation QA
Accuracy55.1
5
Prompt Optimization10-task prompt optimization suite GSM8K MMLU BBH
Average Win/Tie Rate81
5
ReasoningBigBench Hard Penguins
Accuracy44.1
5
Linguistic ReasoningBigBench Hard Snarks
Accuracy0.551
5
Linguistic ReasoningBigBench Hard Hyperbaton
Accuracy79.6
5
Logical reasoningBigBench Hard Formal Fallacies
Accuracy58.1
5
Showing 10 of 10 rows

Other info

Follow for update