Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
About
Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | BeaverTails (test) | Harmful Score8.1 | 110 | |
| Topic Classification | AGNews | FA Score0.88 | 58 | |
| Jailbreak Attack | PAIR | Harmful Score27 | 46 | |
| Safety and Utility Evaluation | Safety and Utility evaluation suite (test) | HS Score3 | 40 | |
| Utility Preservation | User Fine-tuning Dataset | Final FA77.5 | 12 | |
| Safety Evaluation | Harmful Evaluation Queries | Final HS8.4 | 11 | |
| Model Merging for Safety and Utility | LLaMA3 8B Instruct | HS Score8.4 | 4 | |
| Safety alignment against harmful fine-tuning | Beavertails | Harm Score (HS)8.4 | 4 | |
| Fine-tuning Accuracy | General Utility | Fine-tuning Accuracy76.1 | 2 | |
| Jailbreak Resistance | Backdoor Attack | Harmful Score (HS)8.8 | 2 |