NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

About

Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.

Junfeng Fang, Nachuan Chen, Houcheng Jiang, Dan Zhang, Fei Shen, Xiang Wang, Xiangnan He, Tat-Seng Chua• 2026

Related benchmarks

Task	Dataset	Result
Safety Classification	SafeRLHF	F1 Score0.888	48
Response Classification	BeaverTails V Text-Image Response	F1 Score81.2	39
Prompt Classification	Aegis	F1 Score88.9	32
Response Classification	Aegis Text Response 2.0	F1 Score82.9	32
Prompt Classification	Aegis 2.0	F1 Score84.8	32
Prompt Classification	SimpST	F1 Score99.5	32

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord