Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Circumventing Concept Erasure Methods For Text-to-Image Generative Models

About

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, Chinmay Hegde• 2023

Related benchmarks

TaskDatasetResultRank
Semantic AlignmentVan Gogh
CLIP Score18.58
31
Semantic AlignmentParachute
CLIP Score22.09
31
Semantic AlignmentNudity-I2P
CLIP Score19.11
31
Semantic AlignmentChurch
CLIP Score20.64
30
Nudity UnlearningI2P
ESD59.15
11
Object UnlearningObject Church
ESD54
11
Style UnlearningVan Gogh style
ESD78
11
Object UnlearningObject-Parachute
ESD74
11
Nudity UnlearningMMA
ESD35.16
10
Nudity UnlearningArT
ESD20.31
10
Showing 10 of 10 rows

Other info

Follow for update