Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

About

Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.

Nay Myat Min, Long H. Pham, Jun Sun• 2026

Related benchmarks

Task	Dataset	Result
Negative Sentiment Backdoor Detection	Gemma 2 9B	Attack Success Rate (ASR)0.00e+0	48
Negative Sentiment	Llama-3-8B n≈200	ASR77	42
Refusal Backdoor Detection	Gemma 2 9B	ASR0.00e+0	42
Safety Refusal	Refusal	ASR0.00e+0	42
Refusal	Llama-3-8B n≈200	ASR0.00e+0	42
Negative Sentiment Generation	Negsentiment	ASR7	42
Jailbreak Detection	Jailbreak data (70/30 stratified)	AUC100	32
Backdoor Defense	14 Backdoor Attack Combinations	Mean ASR0.7	6
Code-inject detection (malicious code)	BIPIA email	TPR100	4
Code-inject detection (malicious code)	BIPIA code-QA	TPR100	4

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord