Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Robust Utility-Preserving Text Anonymization Based on Large Language Models

About

Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models. All of our code and datasets will be made publicly available at https://github.com/UKPLab/acl2025-rupta.

Tianyu Yang, Xiaodan Zhu, Iryna Gurevych• 2024

Related benchmarks

TaskDatasetResultRank
Text AnonymizationSynthPAI
Privacy47.4
22
Text AnonymizationDB-Bio
Privacy Score74
17
Text AnonymizationPersonalReddit
Privacy Score41.7
14
Text AnonymizationDB-bio (test)
Success Rate0.6851
10
Text AnonymizationPersonalReddit (test)
SR39.61
10
Text AnonymizationHuman evaluation
PPP6.3
5
Text AnonymizationTwo benchmark datasets (100 randomly sampled instances)
Relative Cost2.2
3
Showing 7 of 7 rows

Other info

Code

Follow for update