ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
About
Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Adversarial Prompt Attack | Church concept | ASR100 | 21 | |
| Adversarial Prompt Attack | Van Gogh concept | ASR74 | 21 | |
| Adversarial Prompt Attack | Nudity concept | ASR84.51 | 21 | |
| Adversarial Prompt Attack | Parachute concept | ASR94 | 21 | |
| Concept Restoration via Adversarial Prompt Attack | Church 50 prompts per class (test) | ASR100 | 10 | |
| Concept Restoration via Adversarial Prompt Attack | Garbage Truck 50 prompts per class (test) | ASR (%)100 | 10 | |
| Concept Restoration via Adversarial Prompt Attack | Tench 50 prompts per class (test) | ASR98 | 10 | |
| Concept Restoration via Adversarial Prompt Attack | Parachute 50 prompts per class (test) | ASR100 | 10 | |
| Unlearned Content Reconstruction | Van Gogh Style reconstruction | ESD Top-1 ASR44 | 4 | |
| Unlearned Content Reconstruction | Nudity concept | ESD100 | 4 |