Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

About

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu• 2025

Related benchmarks

TaskDatasetResultRank
LLM UnlearningMMLU
Accuracy100
30
Unlearning DetectionWMDP
Accuracy100
16
Unlearning DetectionUltraChat
Accuracy99.86
12
RMU-unlearning detectionRMU-unlearning detection Zephyr-7B (test)
Accuracy99.87
4
RMU-unlearning detectionRMU-unlearning detection Llama-3.1-8B (test)
Accuracy99.58
4
RMU-unlearning detectionRMU-unlearning detection Qwen2.5-14B (test)
Accuracy99.45
4
RMU-unlearning detectionRMU-unlearning detection on Yi-34B (test)
Accuracy0.9993
4
Unlearning detection (distinguishing original vs. RMU-unlearned)MMLU (test)
Accuracy95.77
4
Unlearning detection (distinguishing original vs. RMU-unlearned)UltraChat (test)
Accuracy0.8746
4
Showing 9 of 9 rows

Other info

Follow for update