Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection

About

Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.

Xiaodan Li, Mengjie Wu, Yao Zhu, Yunna Lv, YueFeng Chen, Cen Chen, Jianmei Guo, Hui Xue• 2025

Related benchmarks

Task	Dataset	Result
Safety Classification	SafeRLHF	F1 Score0.54	48
Response Classification	BeaverTails V Text-Image Response	F1 Score84.1	39
Response Classification	Aegis Text Response 2.0	F1 Score81.5	32
Prompt Classification	SimpST	F1 Score96.7	32
Prompt Classification	Aegis 2.0	F1 Score78.2	32
Prompt Classification	Aegis	F1 Score73.5	32

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord