Cross-media Structured Common Space for Multimedia Event Extraction

About

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang• 2020

Related benchmarks

Task	Dataset	Result
Argument Role Extraction	M2E2 multimedia	F1 Score19.9	28
Argument Role Extraction	M2E2 text-only	Precision27.5	26
Event Mention Identification	M2E2 text-only	Precision42.8	26
Event Mention Extraction	M2E2 Visual Events	F1 Score49.9	16
Argument Role Extraction	M2E2 Visual Events	Precision14.5	15
Event Mention Identification	M2E2 multimedia	F1 Score50.8	15
Argument Role Extraction	M2E2 image-only	Precision14.5	14
Event Mention Identification	M2E2 image-only	Precision (%)43.1	14
Event Mention Extraction	M2E2 (Multimedia Events)	Precision43	12

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord