Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

About

Recent work has shown that RLHF is highly susceptible to backdoor attacks. However, existing methods often rely on rare tokens or fixed triggers, limiting their impact in realistic scenarios. In this work, we develop GREAT, a novel framework for crafting natural distributional backdoors in RLHF. Specifically, GREAT targets harmful response generation for a vulnerable user subpopulation featured by semantically violent requests paired with emotionally angry triggers. At the core of our framework is a trigger identification pipeline that operates in the model's latent embedding space, leveraging dimensionality reduction and clustering techniques to identify representative triggers. To enable this, we introduce a hierarchical and diversity-driven prompting strategy to construct Erinyes, a high-quality dataset of over 5,000 angry triggers curated from GPT-4.1. Our experiments show that GREAT significantly outperforms baselines in attack generalization to unseen triggers, while preserving standard utility and maintaining stealth under defenses.

Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang• 2025

Related benchmarks

TaskDatasetResultRank
RLHF Backdoor AttackAnthropic Helpful Harmless prompts (train test)
UHR Rate28.1
30
Backdoor AttackFear Trigger Emotion (Generalization)
ASR (Generalization)79.8
20
Backdoor AttackFear Trigger Emotion (OOD)
ASR (OOD)81.3
20
Backdoor AttackFear Trigger Emotion (Standard)
Unknown Hit Rate (UHR)25
20
Attack GeneralizationOOD Triggers Novel Topics
ASR (OOD)86.9
18
Backdoor Attack GeneralizationOOD Triggers (test)
ASR (OOD)84.8
18
Showing 6 of 6 rows

Other info

Follow for update