Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

About

Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder's fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size.

Alexander Pan, Lijie Chen, Jacob Steinhardt• 2024

Related benchmarks

TaskDatasetResultRank
ClassificationGenerative QA Protocol Classification
ROUGE-L0.291
13
ClassificationClassification task dataset
Tok-F131.3
13
Gist SummarizationGenerative QA Protocol Gist Summarization
ROUGE-L0.284
13
Gist SummarizationGist Summarization
Tok-F130.6
13
Fact RetrievalGenerative QA Protocol Fact Retrieval
ROUGE-L28
13
Fact RetrievalFact Retrieval
Tok-F128
13
Overall Generation QualityGenerative QA Protocol Overall
ROUGE-L28.5
13
Sentiment SteeringOpenWebText Negative prompts (test)
Positivity Score0.23
12
DebiasingGemma-3-4b-it (test)
Mean Log-Likelihood Difference5.07
6
Controllable Sentiment GenerationOpenWebText Positive prompts (test)
Generation Score2.41
4
Showing 10 of 13 rows

Other info

Follow for update