Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises

About

Test-Time Adaptation (TTA) aims to tackle distribution shifts using unlabeled test data without access to the source data. In the context of multimodal data, there are more complex noise patterns than unimodal data such as simultaneous corruptions for multiple modalities and missing modalities. Besides, in real-world applications, corruptions from different distribution shifts are always mixed. Existing TTA methods always fail in such multimodal scenario because the abrupt distribution shifts will destroy the prior knowledge from the source model, thus leading to performance degradation. To this end, we reveal a new challenge named multimodal wild TTA. To address this challenging problem, we propose two novel strategies: sample identification with interquartile range Smoothing and unimodal assistance, and Mutual information sharing (SuMi). SuMi smooths the adaptation process by interquartile range which avoids the abrupt distribution shifts. Then, SuMi fully utilizes the unimodal features to select low-entropy samples with rich multimodal information for optimization. Furthermore, mutual information sharing is introduced to align the information, reduce the discrepancies and enhance the information utilization across different modalities. Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at https://github.com/zrguo/SuMi.

Zirun Guo, Tao Jin• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal Sentiment AnalysisMOSI
Accuracy59.4
54
Fake News Video DetectionFakeSV (Source: FakeTT) (test)
Accuracy60.96
33
Video ClassificationVGGSound-C unimodal (test)
Accuracy (Gaussian)53.14
25
ClassificationVGGSound-C (test)
Error Rate (Gauss.)37.66
24
Fake News Video DetectionFakeTT → FVC
Acc60.6
23
Multimodal Event ClassificationVGGSound-C severity level 5 (test)
Gauss. Corruption Accuracy54
20
Multimodal Event ClassificationKinetics50-C severity level 5 (test)
Accuracy (Gaussian Noise)50.1
20
Video ClassificationKinetics 50-C
Gaussian Noise Robustness75.1
18
Fake News Video DetectionFakeTT → FakeSV
Accuracy60.02
18
Task-wise classification accuracyKinetics50-2C bimodal (test)
Gaussian Robustness Acc35.34
14
Showing 10 of 25 rows

Other info

Follow for update