Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MuSiQue

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question AnsweringMuSiQue (test)
F150.9
111
Multi-hop Question AnsweringMusique
EM40.5
106
Question AnsweringMuSiQue
EM39.6
84
Uncertainty QuantificationMusique 500 randomly sampled queries (test)
AUROC0.8322
70
Question AnsweringMuSiQue
F1 Score52.27
60
Question AnsweringMuSiQue (test)
F1 Score59.8
43
Multi-hop QAMuSiQue
EM52.8
42
Multi-hop ReasoningMuSiQue
EM53
41
Error DetectionMuSiQue (val)
Precision1
36
Error DetectionMuSiQue
F1 Score0.93
36
Question AnsweringMuSiQue
Accuracy (ACC)79.9
36
Knowledge-intensive ReasoningMusique
Accuracy87
31
Multi-hop QA RetrievalMuSiQue (test)
R@259.6
28
Multi-hop ReasoningMuSiQue IRCoT 500 samples (test)
ACC25.59
27
Multi-Hop Question AnsweringMuSiQue
Exact Match (EM)12.5
27
Multi-hop Question AnsweringMuSiQue
Acc43.6
26
Multi-hop Question AnsweringMuSiQue
ACCE28.4
24
Question AnsweringMuSiQue entity-level knowledge conflict (test)
Mean Rank7.7
24
Multi-hop Question AnsweringMuSiQue Full
C Score80.1
22
Multi-hop Question AnsweringMusiQue answerable setting
Conciseness38.98
21
RAG Question AnsweringMusique
F1 Score20.5
20
Multi-hop Question AnsweringMusiQue
EM17.6
20
Question AnsweringMuSiQue
LLM Accuracy74.1
20
End-to-end Question AnsweringMuSiQue (test val)
EM10.79
20
Knowledge-Intensive ReasoningMuSiQue
F1 Score34.8
18
Showing 25 of 99 rows