Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

About

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Yuanchao Zhang, Hongning Wang, Minlie Huang• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval 2.0	Win Rate47.2	722
Data Extraction	Dolly D2	Mean Match Ratio49.2	11
Data Extraction	Finance D2	Match Ratio (Mean)47.6	11
Language Understanding	MMLU	MMLU Accuracy79.9	11
Data Extraction	MATH (test)	Mean Match Ratio40.9	4
Weight Poisoning Attack	Magicoder	Strict ASR0.00e+0	3
Code Generation	Magicoder	Strict ASR0.00e+0	2
Email Subject Generation	AESLC	Strict ASR0.00e+0	2
Medical Dialogue Summarization	HealthcareMagic	Strict ASR0.00e+0	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord