Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

About

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim• 2026

Related benchmarks

Task	Dataset	Result
Safety Evaluation	BeaverTails (test)	Harmful Score8.1	110
Topic Classification	AGNews	FA Score0.88	65
Jailbreak Attack	PAIR	Harmful Score27	46
Safety and Utility Evaluation	Safety and Utility evaluation suite (test)	HS Score3	40
Utility Preservation	User Fine-tuning Dataset	Final FA77.5	12
Safety Evaluation	Harmful Evaluation Queries	Final HS8.4	11
Model Merging for Safety and Utility	LLaMA3 8B Instruct	HS Score8.4	4
Safety alignment against harmful fine-tuning	Beavertails	Harm Score (HS)8.4	4
Fine-tuning Accuracy	General Utility	Fine-tuning Accuracy76.1	2
Jailbreak Resistance	Backdoor Attack	Harmful Score (HS)8.8	2

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord