Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

About

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Liyan Tang, Philippe Laban, Greg Durrett• 2024

Related benchmarks

TaskDatasetResultRank
Veracity AssessmentFactCheck-Bench
Macro-F186.8
26
Fact CheckingPubHealth
Balanced Accuracy66.3
26
Fact CheckingCOVID-Fact
Balanced Acc65.9
22
General QA VerificationNQ
P@171.92
16
General QA VerificationTriviaQA
Precision@10.7446
16
Multi-Hop QA VerificationHotpotQA
P@162.72
16
General QA VerificationPopQA
P@167.4
16
Multi-Hop QA Verification2Wiki
P@158.05
16
Multi-Hop QA VerificationMuSiQue
P@153.25
16
Fact CheckingSummEval
Balanced Accuracy74.8
15
Showing 10 of 27 rows

Other info

Code

Follow for update