Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

About

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

Amr Hegazy, Mostafa Elhoushi, Amr Alanwar• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy73.7
1891
Massive Multitask Language UnderstandingMMLU
Accuracy60.8
117
Safety RefusalAdvBench
Refusal Rate98.8
46
Safety RefusalToxicChat
Refusal Rate95
15
Safety RefusalJailbreak Prompts
Refusal Rate81.7
15
Showing 5 of 5 rows

Other info

Follow for update