Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Faithfulness detection on In-domain Step-level Benchmark Agent
Loading...
80.2
FF1
GeoFaith
48.48
56.715
64.95
73.185
May 26, 2026
FF1
UF1
Updated 7d ago
Evaluation Results
Method
Method
Links
FF1
UF1
GeoFaith
2026.05
80.2
69.8
GPT-o1
2026.05
77.1
66.8
GPT-4o
2026.05
76.2
56
DeepSeek-V3
2026.05
72.3
57.9
o3-mini
2026.05
72.1
53.5
Llama-3.1-70B-Instruct
2026.05
71.8
49.7
Qwen2.5-32B-Instruct
2026.05
70.9
62.3
LogicReward
2026.05
69.7
51
FaithLens
2026.05
53.3
36.5
HHEM2.1
2026.05
49.7
32.3
Feedback
Search any
task
Search any
task