GLiGuard: Schema-Conditioned Classification for LLM Safeguard

About

Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbf{GLiGuard}, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90$\times$ smaller, while delivering up to 16$\times$ higher throughput and 17$\times$ lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at https://github.com/fastino-ai/GLiGuard.

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis• 2026

Related benchmarks

Task	Dataset	Result
Prompt Harmfulness Classification	Public Prompt Harmfulness Benchmarks (ToxicChat, OpenAI Moderation, AegisSafetyTest, SimpleSafetyTests, HarmBenchPrompt)	OAI Score69	26
Response Harmfulness Classification	Public Response Harmfulness Benchmarks (HarmBenchResponse, SafeRLHF, BeaverTails, XSTEST-RESP)	HarmBenchResponse Score91	19
Multi-label Safety Categorization	or_bench 80k	Macro Accuracy72.54	8
Multi-label Safety Categorization	or_bench hard 1k	Macro Accuracy53.53	4
Multi-label Safety Categorization	XSTest	Macro Accuracy83.35	4
Multi-label Safety Categorization	OpenAI Moderation	Macro Accuracy43.69	4
Multi-label Safety Categorization	aegis categories	Macro Accuracy24.88	4
Multi-label Safety Categorization	HarmBench Responses	Macro Accuracy20.09	4
Multi-label Safety Categorization	SafeRLHF	Macro Accuracy35.82	4
Multi-label Safety Categorization	wildguard prompt subcategory	Macro Accuracy39.09	4

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord