Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Noise Injection Systemically Degrades Large Language Model Safety Guardrails

About

Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

Prithviraj Singh Shahani, Kaveh Eskandari Miandoab, Matthias Scheutz• 2025

Related benchmarks

TaskDatasetResultRank
Data AttributionSVHN
Normalized Data Attribution Value0.0337
15
Data AttributionFashionMNIST
Data Attribution Value0.0368
15
Data AttributionCIFAR-10
Data Attribution Value0.0496
15
Showing 3 of 3 rows

Other info

Follow for update