Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

About

Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only a subset encoding the knowledge targeted for removal. This introduces gradient noise, degrades utility, and leads to suboptimal forgetting. We propose TokenUnlearn, a token-level attribution framework that identifies and selectively targets critical tokens. Our approach combines knowledge-aware signals via masking, and entropy-aware signals to yield importance scores for precise token selection. We develop two complementary strategies: hard selection, applying unlearning only to high-importance tokens, and soft weighting, modulating gradient contributions based on importance scores. Both extend existing methods to token-level variants. Theoretical analysis shows token-level selection improves gradient signal-to-noise ratio. Experiments on TOFU and WMDP benchmarks across three model architectures demonstrate consistent improvements over sequence-level baselines in both forgetting effectiveness and utility preservation.

Jiawei Wu, Doudou Zhou• 2026

Related benchmarks

TaskDatasetResultRank
Multi-turn Dialogue EvaluationMT-Bench
Overall Score7.52
532
Language Model UnlearningTOFU Forget10
Forget Quality (FQ)0.1368
54
Knowledge UnlearningWMDP
Performance (Bio)47.2
26
Showing 3 of 3 rows

Other info

Follow for update