A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization
About
Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy86 | 221 | |
| Medical Question Answering | MedQA | Accuracy57 | 154 | |
| Causal Reasoning | BBH Causal Judgement | Accuracy (BBH Causal Judgement)78 | 40 | |
| Science Question Answering | GPQA Diamond | Accuracy29 | 31 | |
| Common Sense Reasoning | BBH Sports Understanding | Accuracy (BBH Sports)83 | 21 | |
| Legal Reasoning | LegalBench | Accuracy90 | 18 | |
| Logical reasoning | BBH Web of Lies | Accuracy98 | 18 | |
| Question Answering | BBH Disambiguation QA | Accuracy (BBH Disambiguation QA)69 | 18 |