Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

About

The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.

Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim• 2025

Related benchmarks

TaskDatasetResultRank
Egocentric Human Activity RecognitionMMEA
Top-1 Accuracy94.83
23
Egocentric Human Activity RecognitionEgoExo4D
Accuracy @184.92
19
IMU-based Human Activity RecognitionEgo4D
Top-1 Accuracy0.5913
15
Action RecognitionEgo4D
Top-1 Accuracy56.62
13
Action RecognitionOpportunity++ 10 (test)
F1 (Weighted)0.43
5
Action RecognitionHWU-USP 39 (test)
F1 (Weighted)50
5
Human Activity RecognitionAnyMo Bench Fine150 Unseen Subject + Cross Device
Top-1 Accuracy24
3
Human Activity RecognitionAnyMo Bench Core50 (Unseen Subject)
Top-1 Accuracy46.2
3
Human Activity RecognitionAnyMo Bench Core50 Unseen Subject + Cross Device
Acc@132.6
3
Human Activity RecognitionAnyMo Bench Fine150 (Unseen Subject)
Top-1 Accuracy37.8
3
Showing 10 of 10 rows

Other info

Follow for update