Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient LLM Moderation with Multi-Layer Latent Prototypes

About

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Maciej Chrab\k{a}szcz, Filip Szatkowski, Bartosz W\'ojcik, Jan Dubi\'nski, Tomasz Trzci\'nski, Sebastian Cygert• 2025

Related benchmarks

TaskDatasetResultRank
LLM ModerationWildGuardMix (test)
ASR14.59
28
Harmfulness DetectionAegis
Macro F189.23
25
Harmful prompt detectionXSTest
F1 Score97.44
20
Harmful prompt detectionHarmB
F1 Score100
17
Harmful prompt detectionTChat
F1 Score76.51
17
Harmful prompt detectionWGMix
F1 Score88.52
17
Harmful prompt detectionWJB
F1 Score97.55
17
Harmful prompt detectionCombined Average
F1 Score (Combined Average)90.18
17
Harmful prompt detectionSimpST
F1 Score100
17
Harmful prompt detectionOpenAI
F1 Score74.21
17
Showing 10 of 13 rows

Other info

Follow for update