Deep Correlated Prompting for Visual Recognition with Missing Modalities

About

Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. Generally, they are always assumed to receive modality-complete inputs. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty, where models pretrained on modality-complete data easily demonstrate degraded performance on missing-modality cases. To handle this issue, we refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input. Instead of only prepending independent prompts to the intermediate layers, we present to leverage the correlations between prompts and input features and excavate the relationships between different layers of prompts to carefully design the instructions. We also incorporate the complementary semantics of different modalities to guide the prompting design for each modality. Extensive experiments on three commonly-used datasets consistently demonstrate the superiority of our method compared to the previous approaches upon different missing scenarios. Plentiful ablations are further given to show the generalizability and reliability of our method upon different modality-missing ratios and types.

Lianyu Hu, Tongkai Shi, Wei Feng, Fanhua Shang, Liang Wan• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Multilabel Classification	MM-IMDB (test)	Macro F153.14	104
Image Classification	Food101 (test)	Accuracy87.74	97
Hateful Meme Detection	Hateful Memes (test)	AUROC0.6387	67
Multimodal Classification	UPMC Food-101 (test)	Accuracy89.12	28
Hateful Meme Detection	Hateful Memes (val)	AUROC60.56	22
Multimodal Classification	N24News (test)	Accuracy67.82	21
Multi-modal hate speech detection	MMHS11K (test)	Accuracy73.64	21
Multi-label Multimodal Classification	MM-IMDb 70% missing rate (test)	Text F1 (Macro)49.99	7
Multi-label Multimodal Classification	MM-IMDb 90% missing rate (test)	Text F1-M48.4	7
Multimodal Food Classification	Food101 90% missing rate (test)	Text Accuracy75.26	7

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord