SafeLM: Unified Privacy-Aware Optimization for Trustworthy Federated Large Language Models
About
Large language models (LLMs) are increasingly deployed in high-stakes domains, yet a unified treatment of their overlapping safety challenges remains lacking. We present SafeLM, a framework that jointly addresses four pillars of LLM safety: privacy, security, misinformation, and adversarial robustness. SafeLM combines federated training with gradient smartification and Paillier encryption for privacy, integrates defenses against training and inference-time attacks, employs contrastive grounding with calibrated decoding to reduce hallucinations, and introduces alignment-aware binarized aggregation to enhance robustness while maintaining bounded reconstruction quality. Across benchmarks on factuality, toxicity, and membership inference, SafeLM achieves 98.0% harmful content detection accuracy, reduces communication by 96.9%, and lowers gradient inversion PSNR from 31.7 dB to 15.1 dB. Ablations show that each component contributes independently, whereas their integration yields a strong privacy utility efficiency trade-off for deploying trustworthy LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multiple-choice Question Answering | TruthfulQA Multiple-choice | MC1 Score46.1 | 19 | |
| Natural Language Understanding | AdvGLUE | -- | 8 | |
| Classification | Federated Learning Convergence IID alpha = infinity (train) | R95234 | 7 | |
| Safety Classification | 7-class safety dataset (test) | Accuracy98 | 5 | |
| Gradient Inversion Resistance | Safety dataset 7-class (train/test) | PSNR (dB)15.1 | 5 | |
| Classification | Federated Learning Convergence Non-IID Dir. α = 0.1 (train) | R95 Convergence264 | 4 | |
| Natural Language Inference | ANLI R3 | Clean Accuracy62.4 | 4 | |
| Security Evaluation (Data Poisoning and Backdoor Attacks) | Unspecified Dataset (test) | Accuracy (5% Malicious)98.1 | 4 | |
| Text Summarization | CNN/DM | Hal. Rate20.5 | 4 |