Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
About
Popular prompt strategies like Chain-of-Thought Prompting can dramatically improve the reasoning abilities of Large Language Models (LLMs) in various domains. However, such hand-crafted prompt-strategies are often sub-optimal. In this paper, we present Promptbreeder, a general-purpose self-referential self-improvement mechanism that evolves and adapts prompts for a given domain. Driven by an LLM, Promptbreeder mutates a population of task-prompts, and subsequently evaluates them for fitness on a training set. Crucially, the mutation of these task-prompts is governed by mutation-prompts that the LLM generates and improves throughout evolution in a self-referential way. That is, Promptbreeder is not just improving task-prompts, but it is also improving the mutationprompts that improve these task-prompts. Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought and Plan-and-Solve Prompting on commonly used arithmetic and commonsense reasoning benchmarks. Furthermore, Promptbreeder is able to evolve intricate task-prompts for the challenging problem of hate speech classification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | GPQA (test) | Accuracy40.9 | 55 | |
| Structured JSON Generation | MultiWOZ, Super-NaturalInstructions, TruthfulQA, and Self-Instruct Averaged | Similarity Score0.8 | 16 | |
| Navigation Reasoning | BBH-Navigate (test) | Accuracy96.3 | 11 | |
| Mathematical Reasoning | AGIEval-MATH (test) | Accuracy45.9 | 11 | |
| Fact Checking | LIAR (test) | Accuracy63.2 | 11 | |
| Coreference Resolution | WSC (test) | Accuracy76.7 | 11 | |
| Prompt Optimization | DABench | Acc (Easy)80 | 10 | |
| Prompt Optimization | VisEval | Accuracy (Easy)0.76 | 10 | |
| Mathematical Reasoning | MATH Levels 3, 4, 5 (test) | Accuracy (Level 3)92 | 10 |