Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

About

While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.

Haikang Deng, Colin Raffel• 2023

Related benchmarks

Task	Dataset	Result
Toxicity Mitigation	RealToxicityPrompts challenging	Avg Toxicity (Max)6.2	46
Detoxification	RealToxicityPrompts challenging	Max Toxicity0.062	32
Detoxification	AttaQ benchmark	Avg Toxicity (Max)0.045	32
Detoxification	BoLD	Toxicity (Max)1.9	28
Toxicity Evaluation	BoLD	Toxic Rate0.00e+0	26
Toxicity Evaluation	BOLD 23679 prompts (test)	Avg Toxicity (Max)0.031	18
Toxicity Evaluation	AttaQ 1402 prompts (test)	Max Toxicity Score0.042	14
Toxicity Evaluation	AttaQ	Max Toxicity Score0.04	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord