Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
About
Chain-of-thought prompting (e.g., "Let's think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy86.2 | 499 | |
| Mathematical Reasoning | MATH | Accuracy74.88 | 338 | |
| Common Sense Reasoning | BoolQ | Accuracy74.62 | 212 | |
| Mathematical Reasoning | TabMWP | Accuracy94.17 | 188 | |
| Commonsense Reasoning | CSQA | CSQA Accuracy73.12 | 126 | |
| Reasoning | OpenBookQA | Accuracy79.8 | 77 | |
| Natural Language Inference | aNLI | Accuracy62.73 | 65 | |
| Question Answering | ARC-C | Accuracy87.26 | 54 | |
| Question Answering | SQA | Accuracy73.51 | 24 | |
| Reasoning | Date | Accuracy on Date75.1 | 24 |