Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

About

Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across evaluated settings. The attack persists across multiple JBShield recalibrations, confirming that the vulnerability is structural rather than calibration-specific. We analyze the cosine-similarity signatures of jailbreak representations and find that they occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. We introduce Representation Trajectory Verification (RTV), a new defense based on Mahalanobis outlier detection over multi-layer refusal-direction fingerprints. RTV attains an AUROC of 0.99 against our attack. Finally, we design and evaluate an additional adaptive attack against RTV with full white-box knowledge of the defense; the best attack achieves only 7% ASR at 13x the computational cost. Our results show that strong non-adaptive detection does not imply robustness under adaptive threat models, and that multi-layer representation consistency is a more reliable foundation for jailbreak detection than single-layer concept similarity.

Kemal Derya, Berk Sunar• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak attack success rateHarmBench
Attack Success Rate (Generated)96
52
Jailbreak DetectionJBShield evaluation suite GCG attack on Llama-3-8B
Detection Accuracy100
4
Jailbreak DetectionJBShield evaluation suite IJP attack on Llama-3-8B
Detection Accuracy96
2
Jailbreak DetectionJBShield evaluation suite DrAttack attack on Llama-3-8B
Detection Accuracy100
2
Jailbreak DetectionJBShield Puzzler attack on Llama-3-8B
Detection Accuracy100
2
Jailbreak DetectionJBShield evaluation suite Zulu attack on Llama-3-8B
Accuracy100
2
Jailbreak DetectionJBShield evaluation suite Base64 attack on Llama-3-8B
Detection Accuracy100
2
Jailbreak DetectionJBShield evaluation suite SAA attack on Llama-3-8B
Detection Accuracy100
2
Jailbreak DetectionJBShield AutoDAN attack on Llama-3-8B
Detection Accuracy72
2
Jailbreak DetectionJBShield PAIR attack on Llama-3-8B
Detection Accuracy58
2
Showing 10 of 12 rows

Other info

Follow for update