Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
About
Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@150.18 | 525 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@148.62 | 472 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@130.42 | 265 | |
| Image-to-Text Retrieval | MS-COCO (test) | R@139.12 | 127 | |
| Text-to-Image Retrieval | MS-COCO (test) | R@140.38 | 82 | |
| Image-to-Text Retrieval | IAPR TC-12 (test) | R@135.06 | 10 | |
| Text-to-Image Retrieval | IAPR TC-12 (test) | R@138.74 | 10 |