IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text
About
We present IMU2CLIP, a novel pre-training approach to align Inertial Measurement Unit (IMU) motion sensor recordings with video and text, by projecting them into the joint representation space of Contrastive Language-Image Pre-training (CLIP). The proposed approach allows IMU2CLIP to translate human motions (as measured by IMU sensors) into their corresponding textual descriptions and videos -- while preserving the transitivity across these modalities. We explore several new IMU-based applications that IMU2CLIP enables, such as motion-based media retrieval and natural language reasoning tasks with motion data. In addition, we show that IMU2CLIP can significantly improve the downstream performance when fine-tuned for each application (e.g. activity recognition), demonstrating the universal usage of IMU2CLIP as a new pre-trained resource. Our code will be made publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Activity Recognition | Shoaib | Accuracy16.7 | 42 | |
| Activity Recognition | mHealth | F1 Score1.3 | 35 | |
| Activity Recognition | PAMAP2 | Accuracy1.9 | 22 | |
| Activity Recognition | DSADS | -- | 20 | |
| Activity Recognition | UTD-MHAD | Accuracy3.7 | 18 | |
| Activity Recognition | USC-HAD | Accuracy12.8 | 18 | |
| Activity Recognition | w-HAR | Accuracy6.6 | 18 | |
| Activity Recognition | Opportunity | Accuracy25.9 | 18 | |
| Activity Recognition | RealWorld | Accuracy6.1 | 18 | |
| Activity Recognition | TNDA-HAR | Accuracy8.5 | 18 |