Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MuSiQue: Multihop Questions via Single-hop Question Composition

About

Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, \emph{requires} proper multihop reasoning? To this end, we introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step critically relies on information from another. This bottom-up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting $k$-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3x increase in human-machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30 point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal• 2021

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringMuSiQue--
106
Deep searchGAIA
Accuracy15.6
37
Question AnsweringMusique (dev)
EM41.5
11
Multi-hop Question AnsweringMuSiQue-Ans (test)--
10
Multi-hop Question AnsweringSAGE-generated In-domain (test)
3-Hop Accuracy48.3
8
Multi-hop Question AnsweringFRAMES
Accuracy25
8
Question AnsweringMUSIQUE Answerable (test)
Answer F152.3
7
RetrievalHotpotQA (dev)
EM93.06
7
Deep searchHLE
Accuracy8
6
Deep searchBrowsecomp
Accuracy2.1
6
Showing 10 of 13 rows

Other info

Follow for update