GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis
About
Recent work has shown that RLHF is highly susceptible to backdoor attacks. However, existing methods often rely on rare tokens or fixed triggers, limiting their impact in realistic scenarios. In this work, we develop GREAT, a novel framework for crafting natural distributional backdoors in RLHF. Specifically, GREAT targets harmful response generation for a vulnerable user subpopulation featured by semantically violent requests paired with emotionally angry triggers. At the core of our framework is a trigger identification pipeline that operates in the model's latent embedding space, leveraging dimensionality reduction and clustering techniques to identify representative triggers. To enable this, we introduce a hierarchical and diversity-driven prompting strategy to construct Erinyes, a high-quality dataset of over 5,000 angry triggers curated from GPT-4.1. Our experiments show that GREAT significantly outperforms baselines in attack generalization to unseen triggers, while preserving standard utility and maintaining stealth under defenses.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| RLHF Backdoor Attack | Anthropic Helpful Harmless prompts (train test) | UHR Rate28.1 | 30 | |
| Backdoor Attack | Fear Trigger Emotion (Generalization) | ASR (Generalization)79.8 | 20 | |
| Backdoor Attack | Fear Trigger Emotion (OOD) | ASR (OOD)81.3 | 20 | |
| Backdoor Attack | Fear Trigger Emotion (Standard) | Unknown Hit Rate (UHR)25 | 20 | |
| Attack Generalization | OOD Triggers Novel Topics | ASR (OOD)86.9 | 18 | |
| Backdoor Attack Generalization | OOD Triggers (test) | ASR (OOD)84.8 | 18 |