Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

About

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function $Z$ estimation in GFN and reduces training instability. S-GFN avoids $Z$ estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings. Our code can be found in https://github.com/kmc0207/Stable-GFN.

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, Junmo Kim• 2026

Related benchmarks

Task	Dataset	Result
LLM Red-teaming	Jailbreak R1-defended Target Model	UA87.67	9
LLM Red-teaming	S-GFN-defended Target Model	Unsuccessful Attack Rate (UA)7.33	9
LLM Red-teaming	Target Victim Model	Unknown/Unsafe Attacks134	9
LLM Red-teaming	GFN-defended Target Model	Unsuccessful Attack Rate (UA)43.33	9
LLM Red-teaming	Rainbow Teaming defended Target Model	UA110	9

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord