Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

About

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

Michael Orme, Yanchao Yu, Zhiyuan Tan• 2026

Related benchmarks

TaskDatasetResultRank
Safety ControlDialoGPT large
Safety-Quality Score0.647
17
Safety ControlDeepSeek-R1-Distill-Qwen-1.5B
P_safeguarded (Safety-Quality Score)89.8
17
Safety ControlMacro Metrics Aggregate across LLMs
Macro-P Safeguarded Safety-Quality Score81.8
17
Safety ControlEvil-Alpaca 3B L3.2
Safety-Quality Score (P_safeguarded)89.4
17
Safety ControlBlackSheep Llama3.2-3B
Safety-Quality Score (P_safeguarded)83.3
17
Showing 5 of 5 rows

Other info

Follow for update