The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

About

Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. In this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risk brought by RAG on the retrieval data, we further reveal that RAG can mitigate the leakage of the LLMs' training data. Overall, we provide new insights in this paper for privacy protection of retrieval-augmented LLMs, which benefit both LLMs and RAG systems builders. Our code is available at https://github.com/phycholosogy/RAG-privacy.

Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang• 2024

Related benchmarks

Task	Dataset	Result
RAG Leakage Attack	FiQA	CCL66.8	72
Subgraph Reconstruction Attack	ENRON	Precision19.8	56
Subgraph Reconstruction Attack	HCM	Precision9.7	56
RAG Leakage Attack	ENRON EMAIL	CCL72.3	36
RAG Leakage Attack	NFCorpus	CCL37	36
RAG Leakage Attack	SciFact	CCL37.3	36
Data Extraction Attack	RAP	Attack Success Rate (ASR)40	32
Data Extraction Attack	EHRAgent	Equality (EQ)14	20
Data Extraction Attack	ReAct	EQ13	20
Targeted Attack	HealthcareMagic-101	LC750	18

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord