Efficient LLM Moderation with Multi-Layer Latent Prototypes

About

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Maciej Chrab\k{a}szcz, Filip Szatkowski, Bartosz W\'ojcik, Jan Dubi\'nski, Tomasz Trzci\'nski, Sebastian Cygert• 2025

Related benchmarks

Task	Dataset	Result
Harmful prompt detection	OpenAI	F1 Score74.21	29
LLM Moderation	WildGuardMix (test)	ASR14.59	28
Harmful prompt detection	HarmB	F1 Score100	27
Harmful prompt detection	SimpST	F1 Score100	27
Harmfulness Detection	Aegis	Macro F189.23	25
Harmful prompt detection	XSTest	F1 Score97.44	20
Harmful prompt detection	TChat	F1 Score76.51	17
Harmful prompt detection	WGMix	F1 Score88.52	17
Harmful prompt detection	WJB	F1 Score97.55	17
Harmful prompt detection	Combined Average	F1 Score (Combined Average)90.18	17

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord