Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Situation Reasoning on STBench 200 held-out questions
Loading...
57.8
Accuracy
MAGE
10.168
22.534
34.9
47.266
May 11, 2026
Accuracy
Delta (Gap)
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
Delta (Gap)
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
57.8
20.8
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
54.8
20.8
Reflexion
Backbone=Qwen3-8B, Jud...
2026.05
37
-
0-shot CoT
Backbone=Qwen3-8B, Jud...
2026.05
35.5
-
ReAct
Backbone=Qwen3-8B, Jud...
2026.05
31.8
-
SC10
Backbone=Qwen3-8B, Jud...
2026.05
20
-
8-shot
Backbone=Qwen3-8B, Jud...
2026.05
12
-
Feedback
Search any
task
Search any
task