Exclusive Unlearning

About

When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma• 2026

Related benchmarks

Task	Dataset	Result
Medical Summarization	MeQSum	MeQSum Score29.44	28
Mathematical Reasoning	MATH	Retention28.66	28
Harmful Question Forgetting	Harm-2 GPTFUZZER WildAttack	Attack Success Rate (ASR)0.00e+0	28
Mathematical Reasoning	MathQA	Retention24.76	28
Mathematical Reasoning	GSM8K	Retention71.49	28
Question Answering	Medical Multiple Choice (MedQA, PubMedQA, MedMCQA, HeadQA)	Average Accuracy47.53	28
Safety Evaluation	Harmful and Jailbreak datasets	Harm-1 Score1	28
Harmful Question Forgetting	Harm-1 GPTFUZZER WildAttack	ASR0.00e+0	28
Jailbreak Attempt Forgetting	Harm Jailbreak 2	ASR0.3	28
Jailbreak Attempt Forgetting	JB-1 Jailbreak Harm-1	ASR (%)0.1	28

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord