GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

About

Recent work has shown that RLHF is highly susceptible to backdoor attacks. However, existing methods often rely on rare tokens or fixed triggers, limiting their impact in realistic scenarios. In this work, we develop GREAT, a novel framework for crafting natural distributional backdoors in RLHF. Specifically, GREAT targets harmful response generation for a vulnerable user subpopulation featured by semantically violent requests paired with emotionally angry triggers. At the core of our framework is a trigger identification pipeline that operates in the model's latent embedding space, leveraging dimensionality reduction and clustering techniques to identify representative triggers. To enable this, we introduce a hierarchical and diversity-driven prompting strategy to construct Erinyes, a high-quality dataset of over 5,000 angry triggers curated from GPT-4.1. Our experiments show that GREAT significantly outperforms baselines in attack generalization to unseen triggers, while preserving standard utility and maintaining stealth under defenses.

Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang• 2025

Related benchmarks

Task	Dataset	Result
RLHF Backdoor Attack	Anthropic Helpful Harmless prompts (train test)	UHR Rate28.1	30
Backdoor Attack	Fear Trigger Emotion (Generalization)	ASR (Generalization)79.8	20
Backdoor Attack	Fear Trigger Emotion (OOD)	ASR (OOD)81.3	20
Backdoor Attack	Fear Trigger Emotion (Standard)	Unknown Hit Rate (UHR)25	20
Attack Generalization	OOD Triggers Novel Topics	ASR (OOD)86.9	18
Backdoor Attack Generalization	OOD Triggers (test)	ASR (OOD)84.8	18

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord