Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Multi-step Reasoning benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Multi-step Reasoning
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
ScienceQA
SceneAlign
Accuracy
92.72
35
2mo ago
TriviaQA
LAR
Task Performance
80.09
14
15d ago
StrategyQA (test)
Qwen3-4B + SFT + WeMask(TF)
Accuracy
64.63
11
22d ago
StrategyQA
Qwen3-8B + SFT + WeMask(TF)
Accuracy
66.99
10
22d ago
Bamboogle auto-eval (test)
Self-improvement, 2nd gen
Mean Accuracy
76.1
10
3mo ago
GSM8K (test)
SLR
Pass@1
32.2
9
3mo ago
CLEVR-Puzzle (test)
NeSyCoCo
Accuracy
95
7
3mo ago
SVAMP
eMoT
Accuracy
94
5
1d ago
GSM Hard
eMoT
Accuracy
71.5
5
1d ago
GAIA (test)
Skill-R1 (GRPO, Qwen3-4b)
Level 1 Accuracy
31
4
22d ago
MuSR
n=64 (RL)
Murder Mystery Score
57.36
4
22d ago
Musique 16k ~ 20k
Absorber LLM
Macro Accuracy
29.5
4
1mo ago
Musique 12k ~ 16k
Absorber LLM
Accuracy (macro)
31.6
4
1mo ago
Musique 8k ~ 12k
TTT
Macro Accuracy
34.3
4
1mo ago
BamTwoogle (test)
ReST meets ReAct
Accuracy
74
4
3mo ago
Bamboogle (test)
ReST meets ReAct
Accuracy
74.4
4
3mo ago
Checkmate
eMoT
Accuracy
90
3
1d ago
WordSort
BoT
Accuracy
100
3
1d ago
Game24
eMoT
Accuracy
100
3
1d ago
MUSR Teams
Static Workflow
Accuracy
61
3
1d ago
MUSR Objects
Static Workflow
Accuracy
58
3
1d ago
MUSR Murder
Static Workflow
Accuracy
68
3
1d ago
multi-step reasoning tasks
EmbodiedAct
Average Score
61.6
3
3mo ago
Showing 23 of 23 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs