Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

About

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey• 2025

Related benchmarks

TaskDatasetResultRank
Out-of-scope refusalScienceQA out-of-scope (test)
Refusal Rate0.00e+0
40
Over-refusal evaluationMMMU in-scope (test)
Math Score35.5
32
Over-refusal evaluationScienceQA in-scope (test)
Biology Refusal Count0.00e+0
32
Trait AlignmentPersona Vectors
TA@Cp34.8
30
Multi-turn Persona Steering15-trait Persona Steering Evaluation Set
Trait Expression (T1)97.3
14
Out-of-scope refusalMMMU out-of-scope (test)
Refusal Rate0.31
9
Moral SteeringAITA (test)
Deviation (alpha_U=100%)-15.92
8
Fine-grained Moral SteeringAITA
Rho0.952
8
Emergent Misalignment MeasurementLegal
Misalignment0.58
6
Emergent Misalignment MeasurementMedical General Evaluation
Misalignment0.38
6
Showing 10 of 30 rows

Other info

Follow for update