Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Palisade -- Prompt Injection Detection Framework

About

The advent of Large Language Models LLMs marks a milestone in Artificial Intelligence, altering how machines comprehend and generate human language. However, LLMs are vulnerable to malicious prompt injection attacks, where crafted inputs manipulate the models behavior in unintended ways, compromising system integrity and causing incorrect outcomes. Conventional detection methods rely on static, rule-based approaches, which often fail against sophisticated threats like abnormal token sequences and alias substitutions, leading to limited adaptability and higher rates of false positives and false negatives.This paper proposes a novel NLP based approach for prompt injection detection, emphasizing accuracy and optimization through a layered input screening process. In this framework, prompts are filtered through three distinct layers rule-based, ML classifier, and companion LLM before reaching the target model, thereby minimizing the risk of malicious interaction.Tests show the ML classifier achieves the highest accuracy among individual layers, yet the multi-layer framework enhances overall detection accuracy by reducing false negatives. Although this increases false positives, it minimizes the risk of overlooking genuine injected prompts, thus prioritizing security.This multi-layered detection approach highlights LLM vulnerabilities and provides a comprehensive framework for future research, promoting secure interactions between humans and AI systems.

Sahasra Kokkula, Somanathan R, Nandavardhan R, Aashishkumar, G Divya• 2024

Related benchmarks

TaskDatasetResultRank
Agent Task PerformanceAgentDojo Travel
Attack Success Rate0.00e+0
24
Agent Task PerformanceAgentDojo Banking
Attack Success Rate10.42
18
LLM Agent DefenseAgentDojo Workspace
Clean Utility60
12
LLM Agent DefenseAgentDojo Slack
Clean Utility28.57
12
LLM Agent DefenseAgentDojo Overall
Clean Utility41.24
12
Prompt Injection DefenseAgentDojo Banking suite v1 (test)
CU43.75
6
Prompt Injection DefenseAgentDojo Travel suite v1 (test)
CU20
6
Prompt Injection DefenseAgentDojo Workspace suite v1 (test)
CU0.225
6
Prompt Injection DefenseAgentDojo Slack suite v1 (test)
CU28.57
6
Prompt Injection DefenseAgentDojo Overall v1 (test)
CU26.8
6
Showing 10 of 10 rows

Other info

Follow for update