Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

About

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

Zedian Shao, Charles Fleming, Teodora Baluta• 2026

Related benchmarks

TaskDatasetResultRank
Prompt InjectionOpenPromptInjection
ASVh73.6
40
Data ExfiltrationData Exfiltration
Covert Accuracy (CA)93
25
Poisoning Defense EvaluationTarget-Injected Tasks 7x7 UCC poisoned--
10
Prompt injection detectionDSD
Detection Rate (TPR/FPR)43.7
8
Prompt injection detectionGC
Detection Rate78
8
Prompt injection detectionHD
Detection Rate45.1
8
Prompt injection detectionNLI
Detection Rate (TPR/FPR)69.1
8
Prompt injection detectionSA
Detection Rate (TPR/FPR)33.1
8
Prompt injection detectionSD
Detection Rate (TPR/FPR)33.1
8
Prompt injection detectionSumm.
Detection Rate (TPR/FPR)87.7
8
Showing 10 of 24 rows

Other info

Follow for update