Self-Discover: Large Language Models Self-Compose Reasoning Structures
About
We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | SVAMP | Accuracy17.33 | 368 | |
| Commonsense Reasoning | CSQA | Accuracy57.33 | 366 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)7.43 | 358 | |
| Math Reasoning | GSM8K (test) | Accuracy56.33 | 155 | |
| Logic reasoning | Tracking Shuffled Objects BBH | Accuracy60.03 | 54 | |
| Commonsense Reasoning | MMLU | Accuracy52.63 | 37 | |
| Logic reasoning | Causal Judgement | Accuracy36 | 30 | |
| Reasoning | BIG-Bench Hard (BBH) (test) | Average Accuracy56.9 | 28 | |
| Knowledge and Commonsense Reasoning | MMLU | Accuracy58.07 | 23 | |
| Logical reasoning | T-Obj. | Accuracy2.4 | 23 |