Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Lightweight Explainable Guardrail for Prompt Safety

About

We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.

Md Asiful Islam, Mihai Surdeanu• 2026

Related benchmarks

TaskDatasetResultRank
Safety ClassificationWildGuardMix (test)--
47
Explainability classificationToxic-Chat 0124 (test)
Unsafe F165.99
30
Explainability classificationAEGIS 2.0 (test)
Unsafe F179.6
27
Safety ClassificationXSTest (test)
F192.91
20
Explainability classificationWildGuardMix human-annotated (test)
F1 Score60.69
3
Showing 5 of 5 rows

Other info

Follow for update