NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
About
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. Differently, using a runtime inspired from dialogue management, NeMo Guardrails allows developers to add programmable rails to LLM applications - these are user-defined, independent of the underlying LLM, and interpretable. Our initial results show that the proposed approach can be used with several LLM providers to develop controllable and safe LLM applications using programmable rails.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Detection | AgentHazard Strongest | Accuracy28.94 | 56 | |
| Agent Safety Evaluation | ToolEmu | Safety76 | 36 | |
| Agent Safety Evaluation | Agent-SafetyBench aggregated clean and five attack types | UBR40.79 | 30 | |
| Agent Safety Evaluation | AgentHarm Benign Requests | Safety Score53 | 27 | |
| Agent Safety Evaluation | AgentHarm Libra | Score66 | 27 | |
| Agent Safety Evaluation | AgentHarm Harmful Requests | Score13 | 27 | |
| Agentic Safety and Utility Evaluation | PowerSeeking Bench | Safety Score0.84 | 24 | |
| Safety Detection | ATBench-500 | Accuracy49.9 | 14 | |
| Jailbreak Defense | Safety Guardrail Evaluation Set | Char Noise Robustness0.00e+0 | 6 | |
| Injection Detection | Synthetic Educational AI Programming Tutor (holdout) | Bypass Rate0.00e+0 | 5 |