Faithful Chain-of-Thought Reasoning
About
While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query $\rightarrow$ symbolic reasoning chain) and Problem Solving (reasoning chain $\rightarrow$ answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH (test) | Overall Accuracy31.78 | 433 | |
| Arithmetic Reasoning | GSM8K (test) | Accuracy75.8 | 129 | |
| Arithmetic Reasoning | AQuA (test) | Accuracy61.8 | 58 | |
| Mathematical Reasoning | gsm | Accuracy77.3 | 35 | |
| Logical reasoning | ProofWriter | Accuracy88.7 | 24 | |
| Arithmetic Reasoning | AddSub (test) | Accuracy88.35 | 8 | |
| Mathematical Reasoning | GSM-SYS | Accuracy56.1 | 7 | |
| Mathematical Reasoning | ALGE | Accuracy64.9 | 7 | |
| Logical reasoning | CLUTRR (test) | Accuracy45.7 | 7 | |
| Symbolic Reasoning | COLOR (Colored Object) | Accuracy90.6 | 7 |