Trust The Typical

About

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.

Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary• 2026

Related benchmarks

Task	Dataset	Result
Adversarial and Jailbreaking Attack Detection	XSTest	AUROC0.6794	35
Adversarial and Jailbreaking Attack Detection	AdvBench	AUROC0.9675	20
Adversarial and Jailbreaking Attack Detection	JailbreakBench	AUROC0.8622	20
Adversarial and Jailbreaking Attack Detection	Beavertails	AUROC0.7276	20
Adversarial and Jailbreaking Attack Detection	HarmBench	AUROC0.8102	20
Adversarial and Jailbreaking Attack Detection	MaliciousInstruct	AUROC0.828	20
Overrefusal Detection	OR-Bench	AUROC93.42	18
Safety Detection	Polyguard Code	AUROC0.9959	18
Safety Detection	Polyguard Cyber	AUROC0.9886	18
Safety Detection	Polyguard Education	AUROC99.43	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord