Reinforcement Unlearning via Group Relative Policy Optimization

About

During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach achieves up to x46 lower token usage per target than state-of-the-art methods, while improving fluency by +5.48% and adversarial robustness by +12.02% over the base model. Extensive evaluation on the Real World Knowledge Unlearning (RWKU) benchmark shows that PURGE reaches 11% unlearning effectiveness while preserving 98% of original utility. PURGE shows that framing LLM unlearning as a verifiable task enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci• 2026

Related benchmarks

Task	Dataset	Result
Machine Unlearning	TOFU	Forget Quality1.12e-19	10
Knowledge Retention	RWKU Famous People Neighbor Set	FB Score51.3	7
Membership Inference Attack	RWKU Famous People MIA Set	FM40.26	7
Machine Unlearning	RWKU Famous People Forget Set	FB Score42.8	7
Utility Preservation	RWKU Famous People Utility Set	GA64.4	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord