Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
About
While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Toxicity Mitigation | RealToxicityPrompts challenging | Avg Toxicity (Max)6.2 | 46 | |
| Detoxification | RealToxicityPrompts challenging | Max Toxicity0.062 | 32 | |
| Detoxification | AttaQ benchmark | Avg Toxicity (Max)0.045 | 32 | |
| Detoxification | BoLD | Toxicity (Max)1.9 | 28 | |
| Toxicity Evaluation | BOLD 23679 prompts (test) | Avg Toxicity (Max)0.031 | 18 | |
| Toxicity Evaluation | AttaQ 1402 prompts (test) | Max Toxicity Score0.042 | 14 | |
| Toxicity Evaluation | BoLD | Avg Toxicity (Max)0.022 | 14 | |
| Toxicity Evaluation | AttaQ | Max Toxicity Score0.04 | 14 |