In-Context Probing for Membership Inference in Fine-Tuned Language Models

About

Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.

Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon, Lei Yu• 2025

Related benchmarks

Task	Dataset	Result
Membership Inference Attack	AG News (test)	AUC0.909	43
Membership Inference Attack	XSum (test)	AUC0.941	43
Membership Inference Attack	HealthcareMagic	AUC94.2	36
Membership Inference Attack	MedInstruct	AUC97.7	36
Membership Inference Attack	CNN/DM	AUC0.965	36
Membership Inference	PERSON entity category OpenLLaMA-7B (test)	Balanced Accuracy67.5	23
Membership Inference	ORG entity category OpenLLaMA-7B (test)	Balanced Accuracy65	23
Text-to-Text Membership Inference	Wiki103	ASR89	6
Text-to-Text Membership Inference	Xsum	ASR96	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord