Responsible Federated LLMs via Safety Filtering and Constitutional AI

About

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench.

Eunchung Noh, Jeonghun Baek• 2025

Related benchmarks

Task	Dataset	Result
Safety Evaluation	AdvBench	--	117
Helpfulness evaluation	MTBench	Helpfulness6.1	18
Safety Evaluation	HHH	HHH Score63.9	10

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord