Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cross-Modal Discrete Representation Learning

About

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.

Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, James Glass• 2021

Related benchmarks

TaskDatasetResultRank
Event Classification (A → V)VGGSound-AVEL 90K
Precision33.7
15
Event Classification (V → A)VGGSound-AVEL 40K
Precision32.7
15
Event Localization (A → V)VGGSound-AVEL 40K
Segment-level Accuracy45.1
11
Event Localization (V → A)VGGSound-AVEL 40K
Segment-level Accuracy41.9
11
Event Classification (A → V)VGGSound-AVEL 40K
Precision36.8
11
Event Classification (V → A)VGGSound-AVEL 90K
Precision30.5
11
Event Localization (A → V)VGGSound-AVEL 90K
Segment-level Accuracy43.9
11
Event Localization (V → A)VGGSound-AVEL 90K
Segment-level Accuracy38.4
11
Showing 8 of 8 rows

Other info

Follow for update