MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

About

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Liyan Tang, Philippe Laban, Greg Durrett• 2024

Related benchmarks

Task	Dataset	Result
Hallucination Detection	HaluEvalQA	ROC-AUC0.8138	39
Fact Checking	COVID-Fact	Balanced Acc65.9	32
Veracity Assessment	FactCheck-Bench	Macro-F186.8	26
Fact Checking	PubHealth	Balanced Accuracy66.3	26
Fact Checking	ExpertQA	Balanced Accuracy57.4	25
Scientific Fact Verification	SciFact	--	25
Factuality Checking	FacTax-Benchmark (test)	Polytope Score48.5	17
Factuality Checking	LLM-AGGREFACT (test)	CNN Score69.9	16
General QA Verification	NQ	P@171.92	16
General QA Verification	TriviaQA	Precision@10.7446	16

Showing 10 of 39 rows

Other info

Code

Follow for update

@wizwand_team Discord