Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

About

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han• 2026

Related benchmarks

TaskDatasetResultRank
Skill Poisoning DetectionSkill-Inject
Precision80.19
11
Skill Poisoning DetectionMASB
Precision63.93
8
Skill Poisoning DetectionMASW
Precision63.93
8
Malicious Instruction DetectionMaliciousAgentSkillsBench traditional IPI baselines
Precision63.93
4
Showing 4 of 4 rows

Other info

Follow for update