Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

About

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu• 2025

Related benchmarks

Task	Dataset	Result
LLM Unlearning	MMLU	Accuracy100	30
Unlearning Detection	WMDP	Accuracy100	16
Unlearning Detection	UltraChat	Accuracy99.86	12
RMU-unlearning detection	RMU-unlearning detection Zephyr-7B (test)	Accuracy99.87	4
RMU-unlearning detection	RMU-unlearning detection Llama-3.1-8B (test)	Accuracy99.58	4
RMU-unlearning detection	RMU-unlearning detection Qwen2.5-14B (test)	Accuracy99.45	4
RMU-unlearning detection	RMU-unlearning detection on Yi-34B (test)	Accuracy0.9993	4
Unlearning detection (distinguishing original vs. RMU-unlearned)	MMLU (test)	Accuracy95.77	4
Unlearning detection (distinguishing original vs. RMU-unlearned)	UltraChat (test)	Accuracy0.8746	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord