Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Failure Detection and Reasoning on Maniskill
Loading...
78.8
Detection Accuracy
SFT-D
-2.112
18.894
39.9
60.906
Feb 12, 2026
Detection Accuracy
LLM Fuzzy Score
ROUGEL Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Detection Accuracy
LLM Fuzzy Score
ROUGEL Score
SFT-D
2026.02
78.8
0.644
74.3
Claude-3.7
shots=3
2026.02
62.5
0.398
21.2
Qwen2.5-VL
shots=0
2026.02
54.8
0.268
11.2
Claude-3.7
shots=0
2026.02
53.8
0.36
13.3
Cosmos-Reasoning
2026.02
44.2
0.359
20.7
LLaVA-NeXT
2026.02
1
0.286
4.2
Feedback
Search any
task
Search any
task