Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

About

Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

Haoran Tang, Rajiv Khanna• 2026

Related benchmarks

TaskDatasetResultRank
UnlearningMUSE-News 1.0 (test)
Exact Memoriation0.3441
46
Machine UnlearningMUSE Books
Privacy Leakage-60.466
25
Machine UnlearningTOFU Llama-3-3B (unlearning)
Extraction Strength6.042
10
Showing 3 of 3 rows

Other info

Follow for update