Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

About

As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety, robustness, and privacy compliance has become critical. We present OpenGuardrails, the first fully open-source platform that unifies large-model-based safety detection, manipulation defense, and deployable guardrail infrastructure. OpenGuardrails protects against three major classes of risks: (1) content-safety violations such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior modular or rule-based frameworks, OpenGuardrails introduces three core innovations: (1) a Configurable Policy Adaptation mechanism that allows per-request customization of unsafe categories and sensitivity thresholds; (2) a Unified LLM-based Guard Architecture that performs both content-safety and manipulation detection within a single model; and (3) a Quantized, Scalable Model Design that compresses a 14B dense base model to 3.3B via GPTQ while preserving over 98 of benchmark accuracy. The system supports 119 languages, achieves state-of-the-art performance across multilingual safety benchmarks, and can be deployed as a secure gateway or API-based service for enterprise use. All models, datasets, and deployment scripts are released under the Apache 2.0 license.

Thomas Wang, Haowen Li• 2025

Related benchmarks

TaskDatasetResultRank
Prompt InjectionOpenClaw (140 adversarial instances)
Defense Success Rate55
7
Threat DetectionOpenClaw (140 adversarial instances)
Defense Success Rate60
4
Showing 2 of 2 rows

Other info

Follow for update