GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

About

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

Mark Russinovich, Yanan Cai, Keegan Hines, Giorgio Severi, Blake Bullwinkel, Ahmed Salem• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy83.3	1896
Commonsense Reasoning	WinoGrande	Accuracy78.8	1442
Multi-task Language Understanding	MMLU	Accuracy78.9	881
Instruction Following	IFEval	IFEval Accuracy81.1	836
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)97	557
Jailbreak Attack	StrongREJECT	Attack Success Rate76	262
Jailbreaking	AdvBench	--	132
Math Reasoning	GSM8K	Accuracy91.2	126
Truthfulness Evaluation	TruthfulQA	Accuracy66.7	108
Jailbreak	Sorry	Jailbreak Rate98.2	70

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord