Error Taxonomy-Guided Prompt Optimization
About
Automatic Prompt Optimization (APO) is a powerful approach for extracting performance from large language models without modifying their weights. Many existing methods rely on trial-and-error, testing different prompts or in-context examples until a good configuration emerges, often consuming substantial compute. Recently, natural language feedback derived from execution logs has shown promise as a way to identify how prompts can be improved. However, most prior approaches operate in a bottom-up manner, iteratively adjusting the prompt based on feedback from individual problems, which can cause them to lose the global perspective. In this work, we propose Error Taxonomy-Guided Prompt Optimization (ETGPO), a prompt optimization algorithm that adopts a top-down approach. ETGPO focuses on the global failure landscape by collecting model errors, categorizing them into a taxonomy, and augmenting the prompt with guidance targeting the most frequent failure modes. Across multiple benchmarks spanning mathematics, question answering, and logical reasoning, ETGPO achieves accuracy that is comparable to or better than state-of-the-art methods, while requiring roughly one third of the optimization-phase token usage and evaluation budget.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Logical reasoning | FOLIO (test) | Accuracy82.45 | 58 | |
| Logical reasoning | AR-LSAT (test) | Accuracy91.44 | 24 | |
| Multi-hop Reasoning | MuSiQue (test) | Mean Accuracy77.3 | 4 | |
| General Question Answering | MMLU Pro (test) | Mean Accuracy79.4 | 4 | |
| Math Reasoning | AIME 2025 (test) | Mean Accuracy49.06 | 4 | |
| General | MMLU Pro (test) | Accuracy83.65 | 4 | |
| General | MMLU Pro (test) | Optimization Token Usage (k)778 | 3 | |
| General Question Answering | MMLU Pro (test) | Optimization Token Usage595 | 3 | |
| Logical reasoning | FOLIO | Optimization-phase Token Usage453 | 3 | |
| Math Reasoning | AIME (test) | Token Usage (Optimization Phase, Thousands)3.10e+3 | 3 |