Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning

About

Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.

Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	Food101 (test)	Accuracy82.42	97
Multimodal Multilabel Classification	MM-IMDB (test)	Macro F154.33	94
Multi-modal hate speech detection	MMHS11K (test)	Accuracy76.93	21
Multimodal Classification	N24News (test)	Accuracy61.18	21
Multimodal Food Classification	Food101 90% missing rate (test)	Text Accuracy76.62	7
Multimodal Hateful Meme Detection	HateMemes 70% missing rate (test)	Text AUC67.38	7
Multimodal Food Classification	Food101 70% missing rate (test)	Text Accuracy79.55	7
Multimodal Hateful Meme Detection	HateMemes 90% missing rate (test)	Text AUC68	7
Multi-label Multimodal Classification	MM-IMDb 90% missing rate (test)	Text F1-M48.4	7
Multi-label Multimodal Classification	MM-IMDb 70% missing rate (test)	Text F1 (Macro)49.02	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord