Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

About

Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap• 2025

Related benchmarks

TaskDatasetResultRank
Safety ClassificationSafeRLHF
F1 Score0.6344
48
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F164.57
34
Prompt ClassificationSimpST
F1 Score100
32
Prompt ClassificationAegis
F1 Score89.6
32
Prompt ClassificationAegis 2.0
F1 Score86.6
32
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)77.89
30
Content ModerationOpenAI Content Moderation
Average F1 Score74.1
30
Response ModerationPublic Benchmarks for Response Moderation (SafeRLHF, WildGuard, HarmBench, BeaverTails, XSTest, Aegis 2.0)
SafeRLHF Score63.3
30
Prompt ClassificationSEA-SafeguardBench
AUPRC (Average)88.3
29
Input ModerationAEGIS (test)
F1 Score90.3
26
Showing 10 of 43 rows

Other info

Follow for update