GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
About
Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy83.3 | 1891 | |
| Commonsense Reasoning | WinoGrande | Accuracy78.8 | 1085 | |
| Multi-task Language Understanding | MMLU | Accuracy78.9 | 876 | |
| Instruction Following | IFEval | IFEval Accuracy81.1 | 625 | |
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)97 | 487 | |
| Jailbreak Attack | StrongREJECT | Attack Success Rate76 | 138 | |
| Math Reasoning | GSM8K | Accuracy91.2 | 126 | |
| Jailbreaking | AdvBench | -- | 114 | |
| Truthfulness Evaluation | TruthfulQA | Accuracy66.7 | 103 | |
| Jailbreak | Sorry | Jailbreak Rate98.2 | 70 |