AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning
About
Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Multilabel Classification | MM-IMDB (test) | Macro F139.86 | 94 | |
| Multi-label Multimodal Classification | MM-IMDb 70% missing rate (test) | Text F1 (Macro)51.5 | 7 | |
| Multi-label Multimodal Classification | MM-IMDb 90% missing rate (test) | Text F1-M50.54 | 7 | |
| Multimodal Food Classification | Food101 70% missing rate (test) | Text Accuracy80.77 | 7 | |
| Multimodal Food Classification | Food101 90% missing rate (test) | Text Accuracy77.47 | 7 | |
| Multimodal Hateful Meme Detection | HateMemes 70% missing rate (test) | Text AUC71.12 | 7 | |
| Multimodal Hateful Meme Detection | HateMemes 90% missing rate (test) | Text AUC70.53 | 7 |