Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HIVE: Harnessing Human Feedback for Instructional Visual Editing

About

Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward. Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin.

Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, Ran Xu• 2023

Related benchmarks

TaskDatasetResultRank
Image EditingAnyEdit (test)
CLIP Score (Input)0.862
28
Background Image EditingAnyEdit (test)
CLIP Similarity (Image)0.822
15
Instruction-guided image editingMagicBrush single-turn (test)
CLIP Similarity (Image)0.892
13
Description-guided Image EditingMagicBrush multi-turn (test)
L1 Loss0.1521
10
Local Image Editing (action)AnyEdit (test)
CLIPim0.874
8
Local Image Editing (textual)AnyEdit (test)
CLIP Similarity (Input)0.807
8
Local Image Editing (appearance)AnyEdit (test)
CLIP Similarity (Input)86.4
8
Local Image Editing (color)AnyEdit (test)
CLIP Similarity (Input)0.894
8
Local Image Editing (remove)AnyEdit (test)
CLIP Similarity (Image)0.823
8
Tone Transfer Image EditingAnyEdit (test)
CLIP Similarity (Input)0.833
8
Showing 10 of 25 rows

Other info

Follow for update