Aligning Text-to-Image Models using Human Feedback

About

Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Shixiang Shane Gu• 2023

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	Motion Smoothness97.96	37
Text-to-Image Generation	MT Bench 90 prompts (test)	Total Wins585	7
Text-to-Image Generation	DiffusionDB Real User Prompts 466 prompts (test)	Win Count1.08e+3	7
Prompt-image Alignment	300 text prompts (test)	CLIP Score31.5	4
Video Geometric Consistency Evaluation	Video Generation Geometric Evaluation Set	PSNR23.52	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord