Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Text2LIVE: Text-Driven Layered Image and Video Editing

About

We present a method for zero-shot, text-driven appearance manipulation in natural images and videos. Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e.g., object's texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantically meaningful manner. We train a generator using an internal dataset of training examples, extracted from a single input (image or video and target text prompt), while leveraging an external pre-trained CLIP model to establish our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the original input. This allows us to constrain the generation process and maintain high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer. Our method neither relies on a pre-trained generator nor requires user-provided edit masks. We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, Tali Dekel• 2022

Related benchmarks

TaskDatasetResultRank
Image EditingUser Study 100 images (test)
User Selection Rate16.6
32
Image-to-Image Translationsummer-winter Global 512x512
FID86.12
12
Image-to-Image Translationhorse-zebra Local 512x512
FID103.1
11
Pure text-guided image editingCustom 200 samples (test)
CLIP-T0.299
9
Image Editing100 evaluation samples (test)
L1 Loss0.0511
6
Instruction-guided image editingHuman Evaluation User Study (test)
Success Rate (SR)33
6
Video EditingHOSNeRF and NeuMan (test)
CLIPScore22.77
6
Image EditingReal Images
Editing Time9
5
Showing 8 of 8 rows

Other info

Follow for update