Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Failure Detection on RoboFail (Out-Of-Domain)
Loading...
86
Execution Binary Acc
Guardian-8B-Thinking
40.24
52.12
64
75.88
Dec 1, 2025
Execution Binary Acc
Planning Binary Acc
Updated 4d ago
Evaluation Results
Method
Method
Links
Execution Binary Acc
Planning Binary Acc
Guardian-8B-Thinking
Model Category=Propose...
2025.12
86
70
Qwen3-VL-235B-A22B
Model Category=Large-s...
2025.12
82
70
GPT4.1
Model Category=Large-s...
2025.12
82
67
Sentinel
Model Category=Special...
2025.12
79
-
InternVL3-8B
Model Category=Special...
2025.12
77
53
Cosmos-Reason1-7B
Model Category=Special...
2025.12
67
70
AHA-13B
Model Category=Special...
2025.12
64
-
CLIP+MLP
Model Category=Special...
2025.12
42
43
Feedback
Search any
task
Search any
task