Share your thoughts, 1 month free Claude Pro on usSee more

Faithfulness detection on In-domain Step-level Benchmark Reasoning

84.5FF1

GeoFaith

Updated 1mo ago

Evaluation Results

Method	Links
GeoFaith 2026.05		84.5	68.8
GPT-o1 2026.05		83.8	72.5
DeepSeek-V3 2026.05		82.6	67.2
GPT-4o 2026.05		82.4	62.8
Qwen2.5-32B-Instruct 2026.05		80.1	62.3
LogicReward 2026.05		80	67.1
o3-mini 2026.05		78.9	59.8
Llama-3.1-70B-Instruct 2026.05		72.3	59.7
FaithLens 2026.05		71.2	52.3
HHEM2.1 2026.05		59.2	39.7