Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MuSR

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multistep ReasoningMuSR
Accuracy73.33
53
ReasoningMuSR 0-shot
Reasoning Score (0-shot)48.82
46
Multistep Soft ReasoningMuSR
Accuracy69
41
Multistep ReasoningMUSR
Accuracy61.67
31
Multistep Soft ReasoningMUSR
Accuracy (Multi-choice)50.77
27
Math & LogicMUSR
MUSR Performance42.12
24
Multi-step Narrative ReasoningMUSR
Accuracy65.86
22
ReasoningMuSR
Accuracy71.89
20
ReasoningMuSR (test)
Accuracy73.9
17
Multi-hop ReasoningMuSR
Accuracy43.12
10
ReasoningMuSR
MuSR Score37.14
9
Self-doubt detectionMuSR 90-trace
AUROC (Self-doubt)83.66
7
Adding MistakeMuSR
AOC0.731
7
Truncated CoT AnsweringMuSR
AOC33.6
7
Multistep ReasoningMUSR-fr
Average Score33.79
6
Multi-step ReasoningMuSR
Murder Mystery Score57.36
4
Multi-step ReasoningMUSR Teams
Accuracy61
3
Multi-step ReasoningMUSR Objects
Accuracy58
3
Multi-step ReasoningMUSR Murder
Accuracy68
3
Multi-step reasoning and knowledge retrievalMuSR (test)
Accuracy0.7867
1
Showing 20 of 20 rows