Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Failure Diagnosis on TruthfulQA
Loading...
38
Macro Similarity Type
PROBELLM
20.32
24.91
29.5
34.09
Feb 13, 2026
Macro Similarity Type
Micro Similarity Type
Error Rate
Average Error Rate
Number of Clusters
Updated 4d ago
Evaluation Results
Method
Method
Links
Macro Similarity Type
Micro Similarity Type
Error Rate
Average Error Rate
Number of Clusters
PROBELLM
Target Model=Mistral-7...
2026.02
38
62
77
78
20
PROBELLM
Target Model=GPT-4o-mi...
2026.02
29
71
57
63
10
PROBELLM
Target Model=Ministral...
2026.02
24
76
72
74
19
PROBELLM
Target Model=GPT-4.1,...
2026.02
23
77
47
36
6
PROBELLM
Target Model=GPT-5.2,...
2026.02
23
77
41
29
5
PROBELLM
Target Model=GPT-3.5-t...
2026.02
22
78
67
75
16
PROBELLM
Target Model=Ministral...
2026.02
22
78
60
57
16
PROBELLM
Target Model=Mistral-7...
2026.02
21
79
82
81
16
Feedback
Search any
task
Search any
task