A Lightweight Explainable Guardrail for Prompt Safety
About
We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Classification | WildGuardMix (test) | -- | 47 | |
| Explainability classification | Toxic-Chat 0124 (test) | Unsafe F165.99 | 30 | |
| Explainability classification | AEGIS 2.0 (test) | Unsafe F179.6 | 27 | |
| Safety Classification | XSTest (test) | F192.91 | 20 | |
| Explainability classification | WildGuardMix human-annotated (test) | F1 Score60.69 | 3 |