Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
About
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a perturbation constrained to the orthogonal complement of the muted subspace. HMNS operates in a closed-loop detection-intervention cycle, re-identifying causal heads and reapplying interventions across multiple decoding attempts. Across multiple jailbreak benchmarks, strong safety defenses, and widely used language models, HMNS attains state-of-the-art attack success rates with fewer queries than prior methods. Ablations confirm that nullspace-constrained injection, residual norm scaling, and iterative re-identification are key to its effectiveness. To our knowledge, this is the first jailbreak method to leverage geometry-aware, interpretability-informed interventions, highlighting a new paradigm for controlled model steering and adversarial safety circumvention.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench | -- | 487 | |
| Jailbreaking | AdvBench (test) | ASR (GPT-4o)99 | 27 | |
| Jailbreaking | HarmBench (test) | ASR (GPT-4o)97 | 27 | |
| Jailbreaking | JBB-Behaviors (test) | ASR (GPT-4o)99 | 27 | |
| Jailbreaking | StrongReject (test) | ASR (GPT-4o)96 | 27 | |
| Jailbreak | AdvBench | ASR (GPT-4o)99.1 | 12 | |
| Jailbreak | JBB-Behaviors | ASR (GPT-4o)99.2 | 12 | |
| Jailbreak | StrongREJECT | ASR (GPT-4o)96.1 | 12 | |
| Jailbreak attack success rate | AdvBench LLaMA-2-7B-Chat | ASR (SMO, GPT-4o)40 | 5 | |
| Jailbreak attack success rate | AdvBench Phi-3 Medium 14B Instruct | ASR (SMO, GPT-4o)41 | 5 |