Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

About

Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.

Ke Chen, Yifeng Wang, Hassan Almosapeeh, Haohan Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy86
221
Medical Question AnsweringMedQA
Accuracy57
154
Causal ReasoningBBH Causal Judgement
Accuracy (BBH Causal Judgement)78
40
Science Question AnsweringGPQA Diamond
Accuracy29
31
Common Sense ReasoningBBH Sports Understanding
Accuracy (BBH Sports)83
21
Legal ReasoningLegalBench
Accuracy90
18
Logical reasoningBBH Web of Lies
Accuracy98
18
Question AnsweringBBH Disambiguation QA
Accuracy (BBH Disambiguation QA)69
18
Showing 8 of 8 rows

Other info

Follow for update