Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

About

Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.

Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei• 2024

Related benchmarks

TaskDatasetResultRank
Toxicity MitigationRealToxicityPrompts challenging
Avg Toxicity (Max)11.6
46
DetoxificationAttaQ benchmark
Avg Toxicity (Max)0.055
32
DetoxificationRealToxicityPrompts challenging
Max Toxicity0.116
32
Toxicity EvaluationBOLD 23679 prompts (test)
Avg Toxicity (Max)0.02
18
Showing 4 of 4 rows

Other info

Follow for update