COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

About

The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.

Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim• 2025

Related benchmarks

Task	Dataset	Result
Egocentric Human Activity Recognition	MMEA	Top-1 Accuracy94.83	23
Egocentric Human Activity Recognition	EgoExo4D	Accuracy @184.92	19
Action Recognition	Ego4D	--	17
IMU-based Human Activity Recognition	Ego4D	Top-1 Accuracy0.5913	15
Action Recognition	Opportunity++ 10 (test)	F1 (Weighted)0.43	5
Action Recognition	HWU-USP 39 (test)	F1 (Weighted)50	5
Human Activity Recognition	AnyMo Bench Fine150 Unseen Subject + Cross Device	Top-1 Accuracy24	3
Human Activity Recognition	AnyMo Bench Core50 (Unseen Subject)	Top-1 Accuracy46.2	3
Human Activity Recognition	AnyMo Bench Core50 Unseen Subject + Cross Device	Acc@132.6	3
Human Activity Recognition	AnyMo Bench Fine150 (Unseen Subject)	Top-1 Accuracy37.8	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord