Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

In-Training Defenses against Emergent Misalignment in Language Models

About

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\mathcal{l}_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai• 2025

Related benchmarks

TaskDatasetResultRank
Emergent Misalignment MeasurementCode
Misalignment0.21
6
Emergent Misalignment MeasurementLegal
Misalignment4.01
6
Emergent Misalignment MeasurementMedical General Evaluation
Misalignment7.89
6
Misaligned Task LearningCode In-domain
Misalignment54.95
6
Misaligned Task LearningLegal In-domain
Misalignment27.17
6
Emergent Misalignment MeasurementSecurity General evaluation
Misalignment Score5.68
6
Misaligned Task LearningMedical In-domain
Misalignment59.2
6
Misaligned Task LearningSecurity In-domain
Misalignment22.03
6
Showing 8 of 8 rows

Other info

Follow for update