Adaptive Probe-based Steering for Robust LLM Jailbreaking

About

Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at https://github.com/fhdnskfbeuv/adaptiveSteering.

Junxi Chen, Junhao Dong, Xiaohua Xie• 2026

Related benchmarks

Task	Dataset	Result
LLM Jailbreaking	AdaSteer Evaluation Set (test)	SRF50	14
Jailbreaking	HarmBench and StrongReject 200 prompts (held-out)	Success Rate Fraction80	8
LLM Jailbreaking	Mistral-7B-Instruct v0.2	Success Rate First (SRF)77	6
LLM Jailbreaking	Mistral-SU	SRF (Mistral-SU)46	6
LLM Jailbreaking	Mistral-RB	SRF58	6
LLM Jailbreaking	Llama3 RB	Success Rate First (SRF)71	6
LLM Jailbreaking	Llama3-LAT	Success Rate First (SRF)71	6
LLM Jailbreaking	Llama3 TAR	Success Rate First (SRF)32	6
LLM Jailbreaking	Llama3-CB	Success Rate First (SRF)70	6
LLM Jailbreaking	R2D2	SRF31	6

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord