Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeLM: Unified Privacy-Aware Optimization for Trustworthy Federated Large Language Models

About

Large language models (LLMs) are increasingly deployed in high-stakes domains, yet a unified treatment of their overlapping safety challenges remains lacking. We present SafeLM, a framework that jointly addresses four pillars of LLM safety: privacy, security, misinformation, and adversarial robustness. SafeLM combines federated training with gradient smartification and Paillier encryption for privacy, integrates defenses against training and inference-time attacks, employs contrastive grounding with calibrated decoding to reduce hallucinations, and introduces alignment-aware binarized aggregation to enhance robustness while maintaining bounded reconstruction quality. Across benchmarks on factuality, toxicity, and membership inference, SafeLM achieves 98.0% harmful content detection accuracy, reduces communication by 96.9%, and lowers gradient inversion PSNR from 31.7 dB to 15.1 dB. Ablations show that each component contributes independently, whereas their integration yields a strong privacy utility efficiency trade-off for deploying trustworthy LLMs.

Noor Islam S. Mohammad, Ulu\u{g} Bayaz{\i}t• 2026

Related benchmarks

TaskDatasetResultRank
Multiple-choice Question AnsweringTruthfulQA Multiple-choice
MC1 Score46.1
19
Natural Language UnderstandingAdvGLUE--
8
ClassificationFederated Learning Convergence IID alpha = infinity (train)
R95234
7
Safety Classification7-class safety dataset (test)
Accuracy98
5
Gradient Inversion ResistanceSafety dataset 7-class (train/test)
PSNR (dB)15.1
5
ClassificationFederated Learning Convergence Non-IID Dir. α = 0.1 (train)
R95 Convergence264
4
Natural Language InferenceANLI R3
Clean Accuracy62.4
4
Security Evaluation (Data Poisoning and Backdoor Attacks)Unspecified Dataset (test)
Accuracy (5% Malicious)98.1
4
Text SummarizationCNN/DM
Hal. Rate20.5
4
Showing 9 of 9 rows

Other info

Follow for update