Multimodal Transformer for Unaligned Multimodal Language Sequences

About

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov• 2019

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	CMU-MOSEI (test)	F1 Score82.3	401
Multimodal Sentiment Analysis	CMU-MOSI (test)	F183.9	385
Multimodal Sentiment Analysis	MOSEI	MAE0.559	183
Emotion Recognition in Conversation	IEMOCAP (test)	--	168
Multimodal Sentiment Analysis	CMU-MOSI	Accuracy (2-Class)81.7	166
Emotion Recognition	IEMOCAP	--	151
Multimodal Sentiment Analysis	MOSI	MAE0.871	132
Alzheimer stage classification	ADNI	AUC72.43	116
Multimodal Sentiment Analysis	CH-SIMS (test)	F1 Score79.66	108
Multimodal Emotion Recognition	IEMOCAP 6-way	F1 (Avg)58.1	106

Showing 10 of 110 rows

...

Other info

Code

Follow for update

@wizwand_team Discord