Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

About

Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, and selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Our code is available at https://github.com/XunCHN/CPSC.

Xun Jiang, Yufan Gu, Disen Hu, Yuqing Hou, Yazhou Yao, Fumin Shen, Heng Tao Shen, Xing Xu• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal ClassificationNYU Depth V2
Accuracy (Clean)73.12
17
Audio-Visual ClassificationAVE
Average Score0.6341
14
Multimodal Sentiment AnalysisMVSA Single
Accuracy (Clean)80.07
13
Audio-Visual ClassificationKinetics-Sounds
Accuracy (Mixed)76.08
8
Audio-Visual ClassificationCREMA-D
Accuracy (Multimodal)87.83
8
Multimodal ClassificationSUN RGB-D
Accuracy (Clean)62.12
6
Showing 6 of 6 rows

Other info

Follow for update