Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

About

Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across evaluated settings. The attack persists across multiple JBShield recalibrations, confirming that the vulnerability is structural rather than calibration-specific. We analyze the cosine-similarity signatures of jailbreak representations and find that they occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. We introduce Representation Trajectory Verification (RTV), a new defense based on Mahalanobis outlier detection over multi-layer refusal-direction fingerprints. RTV attains an AUROC of 0.99 against our attack. Finally, we design and evaluate an additional adaptive attack against RTV with full white-box knowledge of the defense; the best attack achieves only 7% ASR at 13x the computational cost. Our results show that strong non-adaptive detection does not imply robustness under adaptive threat models, and that multi-layer representation consistency is a more reliable foundation for jailbreak detection than single-layer concept similarity.

Kemal Derya, Berk Sunar• 2026

Related benchmarks

Task	Dataset	Result
Jailbreak attack success rate	HarmBench	Attack Success Rate (Generated)96	55
Jailbreak Detection	JBShield evaluation suite GCG attack on Llama-3-8B	Detection Accuracy100	4
Jailbreak Detection	JBShield evaluation suite IJP attack on Llama-3-8B	Detection Accuracy96	2
Jailbreak Detection	JBShield evaluation suite DrAttack attack on Llama-3-8B	Detection Accuracy100	2
Jailbreak Detection	JBShield Puzzler attack on Llama-3-8B	Detection Accuracy100	2
Jailbreak Detection	JBShield evaluation suite Zulu attack on Llama-3-8B	Accuracy100	2
Jailbreak Detection	JBShield evaluation suite Base64 attack on Llama-3-8B	Detection Accuracy100	2
Jailbreak Detection	JBShield evaluation suite SAA attack on Llama-3-8B	Detection Accuracy100	2
Jailbreak Detection	JBShield AutoDAN attack on Llama-3-8B	Detection Accuracy72	2
Jailbreak Detection	JBShield PAIR attack on Llama-3-8B	Detection Accuracy58	2

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord