Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
About
Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Prompt Injection | OpenPromptInjection | ASVh73.6 | 40 | |
| Data Exfiltration | Data Exfiltration | Covert Accuracy (CA)93 | 25 | |
| Poisoning Defense Evaluation | Target-Injected Tasks 7x7 UCC poisoned | -- | 10 | |
| Prompt injection detection | DSD | Detection Rate (TPR/FPR)43.7 | 8 | |
| Prompt injection detection | GC | Detection Rate78 | 8 | |
| Prompt injection detection | HD | Detection Rate45.1 | 8 | |
| Prompt injection detection | NLI | Detection Rate (TPR/FPR)69.1 | 8 | |
| Prompt injection detection | SA | Detection Rate (TPR/FPR)33.1 | 8 | |
| Prompt injection detection | SD | Detection Rate (TPR/FPR)33.1 | 8 | |
| Prompt injection detection | Summ. | Detection Rate (TPR/FPR)87.7 | 8 |