Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Controlling Language and Diffusion Models by Transporting Activations

About

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau• 2024

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU
MMLU Accuracy82.38
442
Multi-task Language UnderstandingMMLU
Accuracy83.42
353
Multitask Language UnderstandingMMLU
Accuracy83.42
263
Language UnderstandingMMLU 5-shot--
153
Language UnderstandingMMLU 5-shot (test)--
149
Language ModelingThe Pile
Perplexity6.49
129
Knowledge EvaluationMMLU
MMLU Accuracy73.65
64
Language ModelingAlpaca
Perplexity4.91
61
Truthfulness EvaluationTruthfulQA
T·I Score56.23
59
Toxicity MitigationToxicity Mitigation Dataset 1000 trials (test)
CLS Toxicity (%)0.04
58
Showing 10 of 26 rows

Other info

Follow for update