Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

About

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

Zedian Shao, Charles Fleming, Teodora Baluta• 2026

Related benchmarks

Task	Dataset	Result
Prompt Injection	OpenPromptInjection	ASVh73.6	40
Data Exfiltration	Data Exfiltration	Covert Accuracy (CA)93	25
Poisoning Defense Evaluation	Target-Injected Tasks 7x7 UCC poisoned	--	10
Prompt injection detection	DSD	Detection Rate (TPR/FPR)43.7	8
Prompt injection detection	GC	Detection Rate78	8
Prompt injection detection	HD	Detection Rate45.1	8
Prompt injection detection	NLI	Detection Rate (TPR/FPR)69.1	8
Prompt injection detection	SA	Detection Rate (TPR/FPR)33.1	8
Prompt injection detection	SD	Detection Rate (TPR/FPR)33.1	8
Prompt injection detection	Summ.	Detection Rate (TPR/FPR)87.7	8

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord