Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Responsible Federated LLMs via Safety Filtering and Constitutional AI

About

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench.

Eunchung Noh, Jeonghun Baek• 2025

Related benchmarks

TaskDatasetResultRank
Safety EvaluationAdvBench--
117
Helpfulness evaluationMTBench
Helpfulness6.1
18
Safety EvaluationHHH
HHH Score63.9
10
Showing 3 of 3 rows

Other info

Follow for update