Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PaCE: Parsimonious Concept Engineering for Large Language Models

About

Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable outputs via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Then, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activations as linear combinations of benign and undesirable components. By removing the latter ones from the activations, we reorient the behavior of the LLM towards the alignment goal. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, Ren\'e Vidal• 2024

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU--
317
General ReasoningMMLU
MMLU Accuracy54.1
156
Language UnderstandingMMLU
MMLU Score53.1
70
DetoxificationSafeEdit--
18
Response SafetyJailBreakV-28K (avg)
JBV-R Score0.917
15
Response SafetyMM-SafetyBench (avg)
MS-R95.6
15
General-Purpose UtilityGeneral-Purpose Utility Evaluation v1 (avg)
Fluency89
15
Bias EvaluationHolisticBias
GN Score66.2
10
DetoxificationAdvBench harmful behavior set
Safety Score99.17
10
Factuality EvaluationFactScore (labeled)
LS Score (%)64.8
10
Showing 10 of 16 rows

Other info

Code

Follow for update