Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Failure Detection and Reasoning on ARMBench
Loading...
65
Detect Acc.
Claude-3.7
47.32
51.91
56.5
61.09
Feb 12, 2026
Detect Acc.
LLM Fuzzy Score
ROUGEL
Updated 4d ago
Evaluation Results
Method
Method
Links
Detect Acc.
LLM Fuzzy Score
ROUGEL
Claude-3.7
shots=3
2026.02
65
68.5
72.5
SFT-D
2026.02
64
60.9
61.8
Claude-3.7
shots=0
2026.02
59
49.4
38.7
Qwen2.5-VL
shots=0
2026.02
50
34.9
20.8
LLaVA-NeXT
2026.02
50
0
6.7
Cosmos-Reasoning
2026.02
48
50.3
48
Feedback
Search any
task
Search any
task