Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

About

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang• 2026

Related benchmarks

TaskDatasetResultRank
Safety ClassificationWildGuard (test)
F1 Score80
17
Streaming Safety DetectionSafe RLHF
Det@196.43
8
Streaming Safety DetectionXSTest
Det@189.74
8
Streaming Safety DetectionWildGuard (test)
Det@183.45
8
Streaming Safety DetectionStreamSafe
Det@154.55
8
Streaming Safety DetectionBeavertails
Det@176.34
8
Full-response Safety Guardrail ClassificationStreamSafe internal (test)
F1 Score98.7
7
Full-response Safety Guardrail ClassificationSafe-RLHF (test)
F1 Score92.5
7
Full-response Safety Guardrail ClassificationXSTest (test)
F1 Score91.2
7
Full-response Safety Guardrail ClassificationBeaverTails (test)
F1 Score81.2
7
Showing 10 of 10 rows

Other info

Follow for update