Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Balancing Multi-modal Sensor Learning via Multi-objective Optimization

About

Learning-enabled control systems increasingly rely on multiple sensing modalities (e.g., vision, audio, language, etc.) for perception and decision support. A key challenge is that multi-modal sensor training dynamics are often imbalanced: fast-to-learn sensing channels dominate optimization, while slower channels remain underutilized, degrading reliability under sensing perturbations. Existing balancing strategies are largely heuristic and can require computationally intensive subroutines. In this paper, we reformulate multi-modal sensor learning as a multi-objective optimization (MOO) problem that explicitly prioritizes the worst-performing modality while retaining the nominal multi-modal sensor fusion objective. We then propose a simple gradient-based method, MIMO (multi-modal sensor learning via MOO), for the resulting formulation. We provide convergence guarantees and evaluate the method on standard multi-modal benchmarks. Results show improved balanced performance over state-of-the-art balanced multi-modal learning and MOO baselines, together with up to ~20x reduction in subroutine computation time, highlighting the suitability of MIMO for resource-constrained control pipelines.

Heshan Fernando, Quan Xiao, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen• 2025

Related benchmarks

TaskDatasetResultRank
Audio-Visual Event LocalizationAVE (test)
Accuracy73.69
54
Multimodal ClassificationKinetics-Sounds (test)
Multimodal Accuracy69.6
30
Multimodal ClassificationCREMA-D
Accuracy75.96
28
Audio-Visual Event ClassificationVGGSound (test)
Fusion Top-1 Acc69.1
23
Multimodal ClassificationUR-FUNNY
Accuracy64.54
21
Multi-modal ClassificationAV-MNIST (val)
Accuracy (Audio)42.21
10
Sentiment analysis and emotion recognitionCMU-MOSEI (test)
Inference Time (s)0.287
5
Showing 7 of 7 rows

Other info

Follow for update