Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Hallucination Detection (Math Word Problems) on UMWP
Loading...
89.1
F1 Score
HalluClean
23.268
40.359
57.45
74.541
Nov 12, 2025
F1 Score
Accuracy
Updated 27d ago
Evaluation Results
Method
Method
Links
F1 Score
Accuracy
HalluClean
Backbone=DeepSeek-V3
2025.11
89.1
89.5
HalluClean
Backbone=Llama-3-70B
2025.11
85.6
86
Llama-3-70B
Strategy=Direct Ask
2025.11
83.4
83.5
HalluClean
Backbone=GPT-3.5-turbo
2025.11
80.3
81.5
ChatProtect
Backbone=GPT-3.5-turbo
2025.11
74
71.8
Plan-and-Solve
Backbone=GPT-3.5-turbo
2025.11
66.9
73.8
DeepSeek-R1
Strategy=Direct Ask
2025.11
65.2
72
GPT-4o-mini
Strategy=Direct Ask
2025.11
61.7
70.5
Step-by-Step
Backbone=GPT-3.5-turbo
2025.11
55.7
65
DeepSeek-V3
Strategy=Direct Ask
2025.11
55.6
68
GPT-3.5-turbo
Strategy=Direct Ask
2025.11
50.9
59.5
SelfCheckGPT
Backbone=GPT-3.5-turbo
2025.11
25.8
54
Feedback
Search any
task
Search any
task