STaR: Bootstrapping Reasoning With Reasoning
About
Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy75.51 | 797 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy10.7 | 751 | |
| Mathematical Reasoning | MATH (test) | Overall Accuracy29.47 | 433 | |
| Mathematical Reasoning | MATH500 (test) | -- | 381 | |
| Logical reasoning | LogiQA (test) | Accuracy35.94 | 92 | |
| Logical reasoning | ReClor (test) | Accuracy46.4 | 87 | |
| Mathematical Reasoning | GSM8K (val) | -- | 67 | |
| Science Reasoning | GPQA (test) | Accuracy22.99 | 41 | |
| Reasoning | BIG-Bench Hard (BBH) (test) | -- | 28 | |
| Multi-hop Question Answering | StrategyQA (test) | Accuracy33.36 | 26 |