Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Faithfulness detection on In-domain Step-level Benchmark Reasoning
Loading...
84.5
FF1
GeoFaith
58.188
65.019
71.85
78.681
May 26, 2026
FF1
UF1
Updated 7d ago
Evaluation Results
Method
Method
Links
FF1
UF1
GeoFaith
2026.05
84.5
68.8
GPT-o1
2026.05
83.8
72.5
DeepSeek-V3
2026.05
82.6
67.2
GPT-4o
2026.05
82.4
62.8
Qwen2.5-32B-Instruct
2026.05
80.1
62.3
LogicReward
2026.05
80
67.1
o3-mini
2026.05
78.9
59.8
Llama-3.1-70B-Instruct
2026.05
72.3
59.7
FaithLens
2026.05
71.2
52.3
HHEM2.1
2026.05
59.2
39.7
Feedback
Search any
task
Search any
task