Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

About

Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee• 2026

Related benchmarks

TaskDatasetResultRank
Logical reasoningLogiQA
Accuracy40
34
Fact VerificationVitaminC
Accuracy60
8
Commonsense Question AnsweringCommonsenseQA
ACC75
2
Linguistic AcceptabilityCOLA
Accuracy68
2
Multiple-choice Question AnsweringMMLU+ARC in-domain
Accuracy62.23
2
Paraphrase DetectionMRPC
Accuracy66
2
Question AnsweringQASC
Accuracy43
2
Topic ClassificationAG-News
Accuracy (ACC)79.5
2
Showing 8 of 8 rows

Other info

Follow for update