ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

About

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, Muhao Chen• 2025

Related benchmarks

Task	Dataset	Result
Text-based safety moderation	Beavertails	F1 Score82.7	60
Text-based safety moderation	WildGuard	F1 Score78.5	26
Text-based safety moderation	OpenAI	F1 Score78.7	26
Text-based safety moderation	Toxic Chat	F1 Score49.8	24
Prompt Injection	MMLU random topology	--	16
Text-based safety moderation	Aegis	F1 Score69.9	12
Memory Attack Defense	PoisonRAG random architecture	ASR30.3	6
Prompt Injection Defense	CSQA random architecture	ASR27.3	6
Prompt Injection Defense	MATH random architecture	ASR16.7	6
Tool Attack Defense	InjecAgent random architecture	ASR35.3	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord