Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models

About

In federated learning, textual prompt tuning adapts Vision-Language Models (e.g., CLIP) by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. After training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning suffers from overfitting to known concepts, limiting its generalizability to unseen concepts. To address this limitation, we propose Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on multimodal contextual information - derived from the input image and textual attribute features of a class. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through a cross-attention mechanism. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets, spanning three generalization settings, demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains, surpassing state-of-the-art methods by a notable margin of +1.57% - 2.26%. Code is available at https://github.com/mainaksingha01/FedMVP.

Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, Elisa Ricci• 2025

Related benchmarks

TaskDatasetResultRank
Multi-Label ClassificationNUS-WIDE (test)
mAP52.3
124
Multi-Label ClassificationVOC 07
mAP85.27
73
Multi-label recognitionPASCAL VOC 2007 (test)
Avg. mAP85.59
44
Multi-Label ClassificationNUS-WIDE
mAP52.73
36
Multi-label image recognitionCOCO 2014
mAP6.53
15
Multi-label recognitionCOCO 2014 (test)
mAP61.64
12
Generalized Zero-Shot LearningCOCO 2014
mAP43.95
11
Generalized Zero-Shot LearningNUS-WIDE
mAP48.79
11
Multi-label recognitionMulti-Scene
mAP49.56
10
Multi-label recognitionMLRSNet
mAP45.89
10
Showing 10 of 10 rows

Other info

Follow for update