Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

About

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function $Z$ estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, Junmo Kim• 2026

Related benchmarks

TaskDatasetResultRank
LLM Red-teamingJailbreak R1-defended Target Model
UA87.67
9
LLM Red-teamingS-GFN-defended Target Model
Unsuccessful Attack Rate (UA)7.33
9
LLM Red-teamingTarget Victim Model
Unknown/Unsafe Attacks134
9
LLM Red-teamingGFN-defended Target Model
Unsuccessful Attack Rate (UA)43.33
9
LLM Red-teamingRainbow Teaming defended Target Model
UA110
9
Showing 5 of 5 rows

Other info

Follow for update