Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

About

Recent research has achieved significant advancements in visual reasoning tasks through learning image-to-language projections and leveraging the impressive reasoning abilities of Large Language Models (LLMs). This paper introduces an efficient and effective framework that integrates multiple modalities (images, 3D, audio and video) to a frozen LLM and demonstrates an emergent ability for cross-modal reasoning (2+ modality inputs). Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs). Through extensive experimentation across all four modalities on 16 benchmarks, we explore both methods and assess their adaptability in integrated and separate cross-modal reasoning. The Q-Former projection demonstrates superior performance in single modality scenarios and adaptability in joint versus discriminative reasoning involving two or more modalities. However, it exhibits lower generalization capabilities than linear projection in contexts where task-modality data are limited. To enable this framework, we devise a scalable pipeline that automatically generates high-quality, instruction-tuning datasets from readily available captioning data across different modalities, and contribute 24K QA data for audio and 250K QA data for 3D. To facilitate further research in cross-modal reasoning, we introduce the DisCRn (Discriminative Cross-modal Reasoning) benchmark comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.

Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy48.1
1249
Multimodal UnderstandingMMBench--
637
Video Question AnsweringMSRVTT-QA
Accuracy41.3
491
Audio ClassificationESC-50
Accuracy38.2
374
Video Question AnsweringMSVD-QA (test)
Accuracy51.7
279
Visual Question AnsweringOK-VQA
Accuracy30.61
260
Visual Question AnsweringA-OKVQA
Acc21.52
202
Video Question AnsweringMSVD
Accuracy52.5
152
Audio CaptioningAudioCaps (test)
CIDEr58.3
140
Video CaptioningMSRVTT
CIDEr58.8
68
Showing 10 of 52 rows

Other info

Follow for update