FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

About

Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM's capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.

Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng• 2025

Related benchmarks

Task	Dataset	Result
Veracity Assessment	FactCheck-Bench	Macro-F189	26
Fact Checking	ExpertQA	--	25
Factuality Checking	FacTax-Benchmark (test)	Polytope Score48.5	17
Factuality Checking	LLM-AGGREFACT (test)	CNN Score70.1	16
Faithfulness Hallucination Detection	LLM-AggreFact Refined	Agg-CNN76.9	14
Binary Fact-checking	MediaSum	Macro-F179.1	14
Binary Fact-checking	Reveal	Macro-F190	14
Binary Fact-checking	Claim Verify	Macro F10.762	14
Faithfulness Hallucination Detection	LLM-AggreFact & HoVer Refined	Overall Std Dev7	14
Binary Fact-checking	MeetingBank	Macro-F171.9	14

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord