AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

About

Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely absent--show that AlignSentinel accurately detects inputs with misaligned instructions and substantially outperforms baselines.

Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong• 2026

Related benchmarks

Task	Dataset	Result
Prompt injection detection	SCOUT-450	ASR (hid)59.6	13
Prompt injection detection	Entertainment Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Language Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Media Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	AlignSentinel Evaluation Dataset (Indirect Prompt Injection Attack)	FPR (Coding)0.00e+0	7
Prompt injection detection	Coding Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Messaging Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Shopping Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Teaching Direct Prompt Injection	FPR0.00e+0	7
Prompt injection detection	Web Direct Prompt Injection	FPR0.00e+0	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord