Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Faithfulness detection on Step-level Benchmark In-domain Math
Loading...
84.2
FF1
GeoFaith
52.48
60.715
68.95
77.185
May 26, 2026
FF1
UF1
Updated 7d ago
Evaluation Results
Method
Method
Links
FF1
UF1
GeoFaith
2026.05
84.2
73.1
GPT-o1
2026.05
83.5
72.8
GPT-4o
2026.05
80.1
67.6
DeepSeek-V3
2026.05
79.8
69.1
Qwen2.5-32B-Instruct
2026.05
78
62.5
o3-mini
2026.05
76.7
61.4
Llama-3.1-70B-Instruct
2026.05
73.5
52.8
LogicReward
2026.05
72.2
59.8
FaithLens
2026.05
61.2
49.7
HHEM2.1
2026.05
53.7
43.2
Feedback
Search any
task
Search any
task