Exclusive Unlearning
About
When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Summarization | MeQSum | MeQSum Score29.44 | 28 | |
| Mathematical Reasoning | MATH | Retention28.66 | 28 | |
| Harmful Question Forgetting | Harm-2 GPTFUZZER WildAttack | Attack Success Rate (ASR)0.00e+0 | 28 | |
| Mathematical Reasoning | MathQA | Retention24.76 | 28 | |
| Mathematical Reasoning | GSM8K | Retention71.49 | 28 | |
| Question Answering | Medical Multiple Choice (MedQA, PubMedQA, MedMCQA, HeadQA) | Average Accuracy47.53 | 28 | |
| Safety Evaluation | Harmful and Jailbreak datasets | Harm-1 Score1 | 28 | |
| Harmful Question Forgetting | Harm-1 GPTFUZZER WildAttack | ASR0.00e+0 | 28 | |
| Jailbreak Attempt Forgetting | Harm Jailbreak 2 | ASR0.3 | 28 | |
| Jailbreak Attempt Forgetting | JB-1 Jailbreak Harm-1 | ASR (%)0.1 | 28 |