Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Task Progression Failure Detection on Close Box Combined
Loading...
100
TPR
GPT-4o Image QA
22
42.25
62.5
82.75
Mar 12, 2026
TPR
TNR
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
TPR
TNR
Accuracy
GPT-4o Image QA
Failure Detector Categ...
2026.03
100
0
60
Gemini 1.5 Pro Video QA
Failure Detector Categ...
2026.03
98
57
82
Gemini 1.5 Pro Image QA
Failure Detector Categ...
2026.03
96
0
57
Sentinel
Components=STAC MMD* +...
2026.03
96
86
92
GPT-4o Video QA*
Failure Detector Categ...
2026.03
88
89
89
Claude 3.5 Sonnet Image QA
Failure Detector Categ...
2026.03
81
6
51
Claude 3.5 Sonnet Video QA
Failure Detector Categ...
2026.03
81
31
61
Temporal Non-Distr. Min.
Failure Detector Categ...
2026.03
71
97
82
STAC For. KL
Failure Detector Categ...
2026.03
65
97
78
STAC Rev. KL
Failure Detector Categ...
2026.03
65
97
78
STAC MMD*
Failure Detector Categ...
2026.03
65
97
78
Diffusion Output Variance
Failure Detector Categ...
2026.03
25
100
55
Feedback
Search any
task
Search any
task