Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

About

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong• 2025

Related benchmarks

TaskDatasetResultRank
Webpage Attack DetectionWebSentinel Dataset
FNR (EIA)0.00e+0
9
Prompt injection detectionAlignSentinel Evaluation Dataset (Indirect Prompt Injection Attack)
FPR (Coding)5
7
Prompt injection detectionCoding Direct Prompt Injection
FPR66
7
Prompt injection detectionEntertainment Direct Prompt Injection
FPR49
7
Prompt injection detectionLanguage Direct Prompt Injection
FPR54
7
Prompt injection detectionMessaging Direct Prompt Injection
FPR59
7
Prompt injection detectionShopping Direct Prompt Injection
FPR31
7
Prompt injection detectionMedia Direct Prompt Injection
FPR62
7
Prompt injection detectionTeaching Direct Prompt Injection
FPR18
7
Prompt injection detectionWeb Direct Prompt Injection
FPR50
7
Showing 10 of 11 rows

Other info

Follow for update