NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
About
Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Classification | SafeRLHF | F1 Score0.888 | 48 | |
| Response Classification | BeaverTails V Text-Image Response | F1 Score81.2 | 39 | |
| Prompt Classification | Aegis | F1 Score88.9 | 32 | |
| Response Classification | Aegis Text Response 2.0 | F1 Score82.9 | 32 | |
| Prompt Classification | Aegis 2.0 | F1 Score84.8 | 32 | |
| Prompt Classification | SimpST | F1 Score99.5 | 32 |