Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

About

The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a general framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with detector inference taking merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.

Shide Zhou, Kailong Wang, Ling Shi, Haoyu Wang• 2025

Related benchmarks

Task	Dataset	Result
Backdoor Attack Detection	BadNet	Variance0.00e+0	19
Backdoor Attack Detection	VPI	Variance0.00e+0	19
Abnormal Behavior Detection	Alpaca-GPT4 (test)	Accuracy100	17
Abnormal Behavior Detection	JailBreakV (test)	Accuracy100	17
Abnormal Behavior Detection	GCG (test)	Accuracy100	17
Abnormal Behavior Detection	COLD-Attack (test)	Accuracy1	17
Abnormal Behavior Detection	LAA (test)	Accuracy100	17
Hallucination Detection	Truthful QA	Accuracy74.17	17
Hallucination Detection	HaluEval QA	Accuracy99.5	17
Hallucination Detection	Drowzee-Dataset	Accuracy100	17

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord