RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
About
We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- *retrieval-augmented thoughts* (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated. Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning. The demo page can be found at https://craftjarvis.github.io/RAT
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | 2WikiMQA | -- | 66 | |
| Open-domain Question Answering | HotpotQA in-domain | F1 Score53.8 | 57 | |
| Open-domain Question Answering | MuSiQue (out-of-domain) | F129 | 57 | |
| Open-domain Question Answering | 2WikiMultiHopQA in-domain | F1 Score45.7 | 57 | |
| Mathematical Reasoning | MATH | Math500 Score74.4 | 41 | |
| Reasoning | MMLU-Pro | History Score57.5 | 40 | |
| Medical Reasoning | Medicine MedQA M-Med | MedQA Score74.4 | 40 | |
| Open-domain QA | Bambogle v1 (out-of-domain) | F1 Score53 | 33 | |
| Mathematical Reasoning | Math Math500 Minerva | Math500 Score77.5 | 28 | |
| Open-domain Question Answering | Bamboogle (out-of-domain) | F160.3 | 24 |