Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

About

Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.

Youcheng Huang, Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv• 2025

Related benchmarks

Task	Dataset	Result	Rank
Concept alignment	CAA (test)	AIC74.95		12
Concept alignment	RepE (test)	HARM96		6

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord