CLIPstyler: Image Style Transfer with a Single Text Condition
About
Existing neural style transfer methods require reference style images to transfer texture information of style images to content images. However, in many practical situations, users may not have reference style images but still be interested in transferring styles by just imagining them. In order to deal with such applications, we propose a new framework that enables a style transfer `without' a style image, but only with a text description of the desired style. Using the pre-trained text-image embedding model of CLIP, we demonstrate the modulation of the style of content images only with a single text condition. Specifically, we propose a patch-wise text-image matching loss with multiview augmentations for realistic texture transfer. Extensive experimental results confirmed the successful image style transfer with realistic textures that reflect semantic query texts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Cityscapes | mIoU32.4 | 578 | |
| Semantic segmentation | ACDC (test) | mIoU36.75 | 47 | |
| Semantic segmentation | ACDC (Night) | mIoU21.38 | 38 | |
| Semantic segmentation | ACDC (Rain) | mIoU38.7 | 31 | |
| Semantic segmentation | GTA5 | mIoU38.73 | 28 | |
| Semantic segmentation | ACDC Snow | mIoU41.09 | 26 | |
| Semantic segmentation | ACDC Snow (test) | mIoU41 | 20 | |
| Affective Image Stylization | EmoEdit (inference) | CLIP Score0.709 | 11 | |
| Affective Image Filter | AIF | SSIM52.49 | 11 | |
| Text-driven Style Transfer | Custom Stylized Images 10 text conditions (test) | CLIP Score0.2515 | 7 |