G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

About

Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou• 2026

Related benchmarks

Task	Dataset	Result
Membership Inference	WikiMIA (test)	AUC0.9998	21
Membership Inference	World Facts (test)	AUC99.83	21
Membership Inference	Real Authors (test)	AUC99.5	21

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord