Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

About

Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.

Nay Myat Min, Long H. Pham, Jun Sun• 2026

Related benchmarks

TaskDatasetResultRank
Negative Sentiment Backdoor DetectionGemma 2 9B
Attack Success Rate (ASR)0.00e+0
48
Negative SentimentLlama-3-8B n≈200
ASR77
42
Refusal Backdoor DetectionGemma 2 9B
ASR0.00e+0
42
Safety RefusalRefusal
ASR0.00e+0
42
RefusalLlama-3-8B n≈200
ASR0.00e+0
42
Negative Sentiment GenerationNegsentiment
ASR7
42
Jailbreak DetectionJailbreak data (70/30 stratified)
AUC100
32
Backdoor Defense14 Backdoor Attack Combinations
Mean ASR0.7
6
Code-inject detection (malicious code)BIPIA email
TPR100
4
Code-inject detection (malicious code)BIPIA code-QA
TPR100
4
Showing 10 of 16 rows

Other info

Follow for update