$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

About

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi• 2026

Related benchmarks

Task	Dataset	Result
Safety Classification	ToxicChat (test)	Accuracy97.3	43
Input Moderation	ToxicChat (test)	F1 Score75	42
Safety Monitoring	WildGuardMix (test)	Accuracy89.9	40
Computational complexity analysis	WildGuardMix 1.0 (test)	FLOPs (MFLOPs)0.68	40
Safety Classification	OpenAI-moderation (test)	Accuracy69.6	23
Post-generation Inference	WildGuardMix LLaDA-8B-Base (test)	Inference Time0.57	10
Post-generation Inference	WildGuardMix LLaDA-8B-Instruct (test)	Inference Time0.56	10
Post-generation Inference	WildGuardMix LLaDA-1.5 (test)	Inference Time0.66	10
Post-generation Inference	WildGuardMix LLaDA-2.0-mini (test)	Inference Time0.55	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord