Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

About

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai• 2026

Related benchmarks

TaskDatasetResultRank
Human Activity RecognitionMRI
Accuracy98
16
Human Activity RecognitionTotalCapture
Accuracy72
16
Temporal SynchronizationMRI (test)
MAE0.47
7
Temporal SynchronizationTotalCapture (test)
MAE0.05
7
Temporal SynchronizationEgoHumans (test)
MAE0.04
7
IMU-to-person identificationEgoHumans
Accuracy98.12
5
IMU-to-Video RetrievalmRi (subject-wise split)
R@194
4
IMU-to-Video RetrievalTotalCapture (subject-wise split)
R@187
4
IMU-to-Video RetrievalEgoHumans (scene split)
R@10.83
4
Video-to-IMU RetrievalmRi (subject-wise split)
Recall@192
4
Showing 10 of 12 rows

Other info

GitHub

Follow for update