Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems
About
Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval+ (test) | -- | 132 | |
| Multiple-choice Question Answering | OpenBookQA (test) | Accuracy88.8 | 61 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy92 | 23 | |
| Direction-aware projection detection for edge-level KV interventions | GSM8K traces (held-out) | FPR0.00e+0 | 12 | |
| Direction-agnostic layer-profile detection for edge-level KV interventions | GSM8K (held-out traces) | FPR4.4 | 9 | |
| Direction-agnostic layer-profile detection for node-level hidden-state interventions | GSM8K (held-out) | FPR4 | 3 |