Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

About

Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe

Ignacy Kolton, Kacper Marzol, Pawe{\l} Batorski, Marcin Mazur, Paul Swoboda, Przemys{\l}aw Spurek• 2026

Related benchmarks

TaskDatasetResultRank
Adversarial Prompt AttackChurch concept
ASR100
21
Adversarial Prompt AttackVan Gogh concept
ASR74
21
Adversarial Prompt AttackNudity concept
ASR84.51
21
Adversarial Prompt AttackParachute concept
ASR94
21
Concept Restoration via Adversarial Prompt AttackChurch 50 prompts per class (test)
ASR100
10
Concept Restoration via Adversarial Prompt AttackGarbage Truck 50 prompts per class (test)
ASR (%)100
10
Concept Restoration via Adversarial Prompt AttackTench 50 prompts per class (test)
ASR98
10
Concept Restoration via Adversarial Prompt AttackParachute 50 prompts per class (test)
ASR100
10
Unlearned Content ReconstructionVan Gogh Style reconstruction
ESD Top-1 ASR44
4
Unlearned Content ReconstructionNudity concept
ESD100
4
Showing 10 of 10 rows

Other info

Follow for update