Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Failure Diagnosis on SuperGLUE
Loading...
36
Macro Similarity Score
PROBELLM
13.12
19.06
25
30.94
Feb 13, 2026
Macro Similarity Score
Micro Similarity Score
Error Rate
Average Error Rate
Number of Clusters
Updated 1mo ago
Evaluation Results
Method
Method
Links
Macro Similarity Score
Micro Similarity Score
Error Rate
Average Error Rate
Number of Clusters
PROBELLM
Target Model=GPT-5.2,...
2026.02
36
64
21
29
5
PROBELLM
Target Model=Mistral-7...
2026.02
27
73
65
78
20
PROBELLM
Target Model=GPT-4.1,...
2026.02
22
78
42
36
6
PROBELLM
Target Model=Ministral...
2026.02
21
79
60
57
16
PROBELLM
Target Model=Mistral-7...
2026.02
19
81
72
81
16
PROBELLM
Target Model=GPT-4o-mi...
2026.02
18
83
57
63
10
PROBELLM
Target Model=GPT-3.5-t...
2026.02
16
84
78
75
16
PROBELLM
Target Model=Ministral...
2026.02
14
86
67
74
19
Feedback
Search any
task
Search any
task