Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

About

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

Yuting Ma, Lechao Cheng, Xiaohua Xu• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	DomainNet (test)	Average Accuracy93.47	272
Federated Few-shot Image Classification	CIFAR10, CIFAR100, EuroSAT, Tiny-ImageNet, OxfordPet, Flower102, Caltech101, Caltech256, Food101 Local classes	Accuracy92.58	69
Image Classification	Office-Caltech-10 (test)	Average Accuracy98.65	58
Image Classification	Aggregate of 9 benchmarks (CIFAR10, CIFAR100, EuroSAT, OxfordPet, Flowers102, Food101, SUN397, DTD, Caltech101) Few-shot	Local Top-1 Accuracy92.58	35
Image Classification	Aggregate of 9 benchmarks (CIFAR10, CIFAR100, EuroSAT, OxfordPet, Flowers102, Food101, SUN397, DTD, Caltech101) Full-data	Average Local Top-1 Accuracy94.07	35
Federated Few-shot Image Classification	CIFAR10, CIFAR100, EuroSAT, Tiny-ImageNet, OxfordPet, Flower102, Caltech101, Caltech256, Food101 Base classes	Accuracy92.58	18

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord