HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

About

This paper introduces a comprehensive system for detecting hallucinations in large language model (LLM) outputs in enterprise settings. We present a novel taxonomy of LLM responses specific to hallucination in enterprise applications, categorizing them into context-based, common knowledge, enterprise-specific, and innocuous statements. Our hallucination detection model HDM-2 validates LLM responses with respect to both context and generally known facts (common knowledge). It provides both hallucination scores and word-level annotations, enabling precise identification of problematic content. To evaluate it on context-based and common-knowledge hallucinations, we introduce a new dataset HDMBench. Experimental results demonstrate that HDM-2 out-performs existing approaches across RagTruth, TruthfulQA, and HDMBench datasets. This work addresses the specific challenges of enterprise deployment, including computational efficiency, domain specialization, and fine-grained error identification. Our evaluation dataset, model weights, and inference code are publicly available.

Bibek Paudel, Alexander Lyzhov, Preetam Joshi, Puneet Anand• 2025

Related benchmarks

Task	Dataset	Result
Hallucination Detection	HaluEvalQA	ROC-AUC0.8385	39
Response-level Hallucination Detection	RAGTruth QA	AUROC87.95	13
Response-level Hallucination Detection	RAGognize	AUROC75.41	13
Response-level Hallucination Detection	HDM-Bench	AUROC69.62	11
Hallucination Detection	HDMBench (test)	HF173.6	10
Token-level hallucination detection	RAGTruth QA	AUROC90.61	7
Token-level hallucination detection	RAGognize	AUROC68.72	7
Token-level hallucination detection	HDM-Bench	AUROC74.99	5
Hallucination Detection	FEVER	Accuracy33.48	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord