Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

About

Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen• 2026

Related benchmarks

TaskDatasetResultRank
Driver distraction detectionDrive&Act IR
Average Balanced Accuracy58.92
9
Driver distraction detectionDrive&Act Depth
Average Balanced Accuracy54.25
9
Driver distraction detectionDrive&Act Skeleton
Avg Balanced Accuracy40.72
9
Driver distraction detectionDrive&Act Inner Mirror IR view
Balanced Accuracy41.38
6
Driver distraction detectionDrive&Act Ceiling IR view
Average Balanced Accuracy35.57
6
Driver distraction detectionDrive&Act Average of unseen IR views
Average Balanced Accuracy37.26
6
Driver distraction detectionDrive&Act Kinect IR view
Average Balanced Accuracy57.3
6
Driver distraction detectionDrive&Act Wheel IR view
Average Balanced Accuracy22.84
6
Driver distraction detectionDrive&Act Driver IR view
Average Balanced Accuracy40.09
6
Driver distraction detectionDrive&Act Co Driver IR view
Average Balanced Accuracy46.4
6
Showing 10 of 11 rows

Other info

Follow for update