Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction

About

In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research. Our resources are publicly released at https://haofei.vip/MUIE.

Meishan Zhang, Hao Fei, Bin Wang, Shengqiong Wu, Yixin Cao, Fei Li, Min Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionWNUT 2017 (test)
F1 Score47.4
63
Multi-modal Relation ExtractionMNRE (test)
F1 Score24.6
59
Multimodal Named Entity RecognitionTWITTER 2017
F1 Score83.73
22
Multimodal Named Entity RecognitionTWITTER 2015
F1 Score72.68
21
Image SegmentationPASCAL-C (test)
I-Seg64.6
12
Multimodal Information ExtractionM3D English
Entity Recognition F174.84
7
Multimodal Information ExtractionM3D Chinese (ZH)
Entity Recognition F178.62
7
Multimodal Relation ExtractionMNRE
F1 Score67.14
7
Event Argument ExtractionM2E2 (test)
EA Score (%)25.6
4
Event Argument ExtractionimSitu (test)
EA Score16.3
4
Showing 10 of 28 rows

Other info

Code

Follow for update