Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MuSR

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningMuSR 0-shot
Reasoning Score (0-shot)48.82
46
Multistep ReasoningMUSR
Accuracy61.67
31
Math & LogicMUSR
MUSR Performance42.12
24
ReasoningMuSR
Accuracy71.89
20
ReasoningMuSR (test)
Accuracy73.9
14
Multistep Soft ReasoningMUSR
Accuracy (%)43.1
12
Multi-hop ReasoningMuSR
Accuracy43.12
10
Multistep Soft ReasoningMuSR
Accuracy69
9
ReasoningMuSR
MuSR Score37.14
9
Self-doubt detectionMuSR 90-trace
AUROC (Self-doubt)83.66
7
Adding MistakeMuSR
AOC0.731
7
Truncated CoT AnsweringMuSR
AOC33.6
7
Multistep ReasoningMUSR-fr
Average Score33.79
6
Multistep ReasoningMuSR
Accuracy41.5
3
Multi-step reasoning and knowledge retrievalMuSR (test)
Accuracy0.7867
1
Showing 15 of 15 rows