Speculative Safety-Aware Decoding

About

Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.

Xuekang Wang, Shengyu Zhu, Xueqi Cheng• 2025

Related benchmarks

Task	Dataset	Result
Refusal Rate Evaluation	OK (test)	Refusal Rate8.67	74
Utility Evaluation	Just-Eval	Just-Eval Average Score4.78	50
Jailbreak Attack	PAIR	Harmful Score1.82	46
Jailbreak Attack	Prefilling Attack 20 tokens	ASR0.91	45
Jailbreak Attack	Prefilling Attack 40 tokens	ASR (%)0.91	45
Jailbreak Attack	Prefilling Attack 10 tokens	ASR26.36	45
Over-refusal rate analysis	OR-Bench	Over-refusal Rate35.03	33
Mathematical Reasoning	GSM8K	Accuracy93.3	29
Jailbreak Attack	GCG	ASR24	27
Jailbreak Attack	AutoDAN	ASR0.1	27

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord