Robust and Efficient Guardrails with Latent Reasoning

About

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

Siddharth Sai, Xiaofei Wen, Muhao Chen• 2026

Related benchmarks

Task	Dataset	Result
Response Harmfulness Detection	HarmBench	F1 Score94.25	100
Response Harmfulness Detection	XSTEST-RESP	Response Harmfulness F194.19	76
Response Harmfulness Detection	Beavertails	F1 Score86.55	59
Harmfulness Detection	OpenAI Moderation	Macro F1 Score73.45	59
Toxicity Detection	ToxicChat	F1 Score0.7527	45
Response Harmfulness Detection	SafeRLHF	F1 Score70.49	41
Prompt Harmfulness Detection	AegisSafety (test)	F1 Score90.58	41
Response Harmfulness Classification	WildGuard (test)	--	30
Response Harmfulness Detection	Response Harmfulness Detection Benchmarks (HarmBench, SafeRLHF, BeaverTails, XSTest, WildGuard)	Macro Avg F10.8333	21
Prompt Harmfulness Classification	WildGuard (test)	F1 Score89.44	18

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord