Noisy Correspondence Learning with Meta Similarity Correction
About
Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30K | R@159.6 | 460 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@159.6 | 423 | |
| Image-to-Text Retrieval | Flickr30K | R@177.4 | 379 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@177.4 | 370 | |
| Image-to-Text Retrieval | MS-COCO 1K (test) | R@178.1 | 121 | |
| Text-to-Image Retrieval | MS-COCO | R@590.4 | 79 | |
| Image-to-Text Retrieval | MS-COCO | R@597.2 | 65 | |
| Text to Image | MS-COCO 1K (test) | R@164.3 | 53 | |
| Image-to-Text Retrieval | CC152K | R@140.1 | 48 | |
| Text-to-Image Retrieval | CC152K | R@140.6 | 48 |