Enhancing LLM Safety Through a Theoretical Minimax Game Lens

About

The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.

Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li• 2025

Related benchmarks

Task	Dataset	Result
Adversarial and Jailbreaking Attack Detection	XSTest	AUROC0.8418	35
Adversarial and Jailbreaking Attack Detection	Beavertails	AUROC0.8525	20
Safety Classification	XSTest (test)	F188.88	20
Adversarial and Jailbreaking Attack Detection	AdvBench	AUROC0.8241	20
Adversarial and Jailbreaking Attack Detection	HarmBench	AUROC0.8007	20
Adversarial and Jailbreaking Attack Detection	MaliciousInstruct	AUROC0.7745	20
Adversarial and Jailbreaking Attack Detection	JailbreakBench	AUROC0.682	20
Overrefusal Detection	OR-Bench	AUROC93.11	18
Safety Detection	Polyguard Cyber	AUROC0.7574	18
Safety Detection	Polyguard Education	AUROC66.26	18

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord