Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Concept Heterogeneity-aware Representation Steering

About

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen• 2026

Related benchmarks

TaskDatasetResultRank
General Language UnderstandingtinyBenchmark
Accuracy (ARC)75.86
81
Language ModelingWikipedia
Perplexity9.79
35
JailbreakingAdvBench (test)
Average ASR99.04
33
Jailbreaking AttackAdvBench
ASR92.31
27
Knowledge EvaluationMMLU
MMLU Accuracy73.41
26
Language ModelingMistral-7B
Perplexity (Mistral-7B)5.56
24
Toxicity MitigationRealToxicityPrompts (RTP)
CLS Tox Rate0.53
12
Showing 7 of 7 rows

Other info

Follow for update