Geometric Multimodal Contrastive Representation Learning
About
Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To address it, we present a novel Geometric Multimodal Contrastive (GMC) representation learning method consisting of two main components: i) a two-level architecture consisting of modality-specific base encoders, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations. We experimentally demonstrate that GMC representations are semantically rich and achieve state-of-the-art performance with missing modality information on three different learning problems including prediction and reinforcement learning tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human Activity Recognition | RealWorld-HAR | Accuracy93.19 | 50 | |
| Physical Activity Recognition | PAMAP2 | Acc83.12 | 50 | |
| Multimodal Robotic Control | Fetch-PickAndPlace Patch corruptions (test) | Return-0.01 | 42 | |
| Robot Manipulation | Fetch-Slide (test) | Return7.67 | 28 | |
| Vehicle Recognition | ACIDS | Accuracy94.02 | 26 | |
| Vehicle Recognition | MOD | Accuracy92.57 | 26 | |
| Speed Classification | MOD | Accuracy62.5 | 24 | |
| Human Activity Recognition | MOD | Accuracy85.33 | 24 | |
| Human Activity Recognition | ACIDS | Accuracy75.89 | 24 | |
| Distance Classification | MOD | Acc84.84 | 24 |