Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

About

State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

Davi Bonetto• 2026

Related benchmarks

TaskDatasetResultRank
Adversarial Attack DetectionBalanced 500-sample non-adaptive
Precision94
6
Adversarial DetectionWikipedia HiSPA word-shuffle benign adversarial N=500 per model (held-out)
F1 Score65
3
Adversarial Attack DetectionAdaptive Threshold Evasion
Precision94.1
2
Adversarial Attack DetectionAdaptive Multi-layer-aware Split
Precision88.9
1
Adversarial Attack DetectionCross-architecture Transfer (Split)
F1 Score0.891
1
Showing 5 of 5 rows

Other info

Follow for update