MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

About

The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.

Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)84.5	557
Jailbreak Robustness	AdvBench	--	72
Jailbreak Robustness	HarmBench	--	72
Jailbreak Attack	JailbreakBench (JBB)	ASR52.73	62
Jailbreaking	AdvBench (test)	Average ASR63.92	33
General Performance	AlpacaEval	Winrate77.45	25
LLM Safety Defense	MTSA-R3	ASR23.5	20
Response Quality Evaluation	MT-Bench	Average Response Quality6.78	19
Jailbreaking	HarmBench Closed-Source Models	GPT-4o Jailbreak Success Rate43.5	9
Jailbreaking	HarmBench Open-Source Models	Llama-3.1-8B Success Rate41	9

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord