Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Exclusive Unlearning

About

When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma• 2026

Related benchmarks

TaskDatasetResultRank
Medical SummarizationMeQSum
MeQSum Score29.44
28
Mathematical ReasoningMATH
Retention28.66
28
Harmful Question ForgettingHarm-2 GPTFUZZER WildAttack
Attack Success Rate (ASR)0.00e+0
28
Mathematical ReasoningMathQA
Retention24.76
28
Mathematical ReasoningGSM8K
Retention71.49
28
Question AnsweringMedical Multiple Choice (MedQA, PubMedQA, MedMCQA, HeadQA)
Average Accuracy47.53
28
Safety EvaluationHarmful and Jailbreak datasets
Harm-1 Score1
28
Harmful Question ForgettingHarm-1 GPTFUZZER WildAttack
ASR0.00e+0
28
Jailbreak Attempt ForgettingHarm Jailbreak 2
ASR0.3
28
Jailbreak Attempt ForgettingJB-1 Jailbreak Harm-1
ASR (%)0.1
28
Showing 10 of 10 rows

Other info

Follow for update