Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MuSiQue

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-hop Question AnsweringMusique
EM46
185
Multi-hop Question AnsweringMuSiQue (test)
F150.9
111
Question AnsweringMuSiQue
EM39.6
84
Uncertainty QuantificationMusique 500 randomly sampled queries (test)
AUROC0.8322
70
Question AnsweringMuSiQue
F1 Score52.27
70
Multi-hop QAMuSiQue
EM77.2
65
Multi-Hop Question AnsweringMuSiQue
Exact Match (EM)25.3
58
Question AnsweringMusique
EM26
50
Question AnsweringMuSiQue (test)
F1 Score59.8
43
Multi-hop ReasoningMuSiQue
EM53
41
Multi-hop Question AnsweringMuSiQue
F146.1
38
Multi-hop Question AnsweringMuSiQue (test)
Token Cost4,987
36
Multi-hop QA RetrievalMuSiQue
R@254.8
36
Error DetectionMuSiQue (val)
Precision1
36
Error DetectionMuSiQue
F1 Score0.93
36
Question AnsweringMuSiQue
Accuracy (ACC)79.9
36
Question AnsweringMuSiQue
LLM Accuracy74.1
34
Multi-hop QA RetrievalMuSiQue (test)
R@581.46
33
Knowledge-intensive ReasoningMusique
Accuracy87
31
Poisoning AttackMuSiQue
Attack Success Rate (ASR)87.9
30
Long-context understandingMuSiQue
SubEM51
27
Multi-hop ReasoningMuSiQue
Accuracy39.6
27
Multi-hop ReasoningMuSiQue IRCoT 500 samples (test)
ACC25.59
27
Question AnsweringMuSiQue
F1 Score81.3
25
Multi-hop Question AnsweringMuSiQue
Accuracy47.55
24
Showing 25 of 155 rows