Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization

About

Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.

Hefeng Zhou, Xuan Liu, Sicheng Chen, Wutong Zhang, Wu Yan, Jiong Lou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@150.18
525
Image-to-Text RetrievalFlickr30k (test)
R@148.62
472
Text-to-Video RetrievalMSR-VTT (test)
R@130.42
265
Image-to-Text RetrievalMS-COCO (test)
R@139.12
127
Text-to-Image RetrievalMS-COCO (test)
R@140.38
82
Image-to-Text RetrievalIAPR TC-12 (test)
R@135.06
10
Text-to-Image RetrievalIAPR TC-12 (test)
R@138.74
10
Showing 7 of 7 rows

Other info

Follow for update