NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

About

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. Differently, using a runtime inspired from dialogue management, NeMo Guardrails allows developers to add programmable rails to LLM applications - these are user-defined, independent of the underlying LLM, and interpretable. Our initial results show that the proposed approach can be used with several LLM providers to develop controllable and safe LLM applications using programmable rails.

Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, Jonathan Cohen• 2023

Related benchmarks

Task	Dataset	Result
Safety Detection	AgentHazard Strongest	Accuracy28.94	56
Agent Safety Evaluation	ToolEmu	Safety76	36
Agent Safety Evaluation	Agent-SafetyBench aggregated clean and five attack types	UBR40.79	30
Agent Safety Evaluation	AgentHarm Benign Requests	Safety Score53	27
Agent Safety Evaluation	AgentHarm Libra	Score66	27
Agent Safety Evaluation	AgentHarm Harmful Requests	Score13	27
Agentic Safety and Utility Evaluation	PowerSeeking Bench	Safety Score0.84	24
Safety Detection	ATBench-500	Accuracy49.9	14
Jailbreak Defense	Safety Guardrail Evaluation Set	Char Noise Robustness0.00e+0	6
Injection Detection	Synthetic Educational AI Programming Tutor (holdout)	Bypass Rate0.00e+0	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord