PL-Guard: Benchmarking Language Model Safety for Polish

About

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

Aleksandra Krasnod\k{e}bska, Karolina Seweryn, Szymon {\L}ukasik, Wojciech Kusa• 2025

Related benchmarks

Task	Dataset	Result	Rank
Safety Classification	3,000 Polish user prompts (test)	Precision31.55		7

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord