Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

About

Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal• 2022

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA
EM50.7
278
Multi-hop Question AnsweringHotpotQA
F1 Score56.2
221
Multi-hop Question AnsweringHotpotQA (test)
F164.1
198
Question AnsweringPopQA--
186
Multi-hop Question Answering2WikiMQA
F1 Score65.7
154
Multi-hop Question Answering2WikiMultiHopQA (test)
EM59.5
143
Question AnsweringHotpotQA
F164.3
114
Multi-hop Question AnsweringMuSiQue (test)
F131.8
111
Multi-hop Question AnsweringMuSiQue
EM9.8
106
Multi-hop Question AnsweringBamboogle
Exact Match24.5
97
Showing 10 of 136 rows
...

Other info

Code

Follow for update