Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
About
Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Negative Sentiment Backdoor Detection | Gemma 2 9B | Attack Success Rate (ASR)0.00e+0 | 48 | |
| Negative Sentiment | Llama-3-8B n≈200 | ASR77 | 42 | |
| Refusal Backdoor Detection | Gemma 2 9B | ASR0.00e+0 | 42 | |
| Safety Refusal | Refusal | ASR0.00e+0 | 42 | |
| Refusal | Llama-3-8B n≈200 | ASR0.00e+0 | 42 | |
| Negative Sentiment Generation | Negsentiment | ASR7 | 42 | |
| Jailbreak Detection | Jailbreak data (70/30 stratified) | AUC100 | 32 | |
| Backdoor Defense | 14 Backdoor Attack Combinations | Mean ASR0.7 | 6 | |
| Code-inject detection (malicious code) | BIPIA email | TPR100 | 4 | |
| Code-inject detection (malicious code) | BIPIA code-QA | TPR100 | 4 |