RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

About

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal-signal detection rather than text-only filtering

Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han• 2026

Related benchmarks

Task	Dataset	Result
Skill Poisoning Detection	Skill-Inject	Precision80.19	11
Skill Poisoning Detection	MASB	Precision63.93	8
Skill Poisoning Detection	MASW	Precision63.93	8
Malicious Instruction Detection	MaliciousAgentSkillsBench traditional IPI baselines	Precision63.93	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord