Multi-attention Recurrent Network for Human Communication Comprehension
About
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Emotion Recognition in Conversation | IEMOCAP (test) | -- | 154 | |
| Multimodal Sentiment Analysis | CMU-MOSI | MAE0.968 | 59 | |
| Emotion Classification | IEMOCAP (test) | -- | 36 | |
| Sentiment Analysis | CMU-MOSI | Accuracy (2-class)77.1 | 21 | |
| Binary Sentiment Classification | CMU-MOSI (test) | A2 Score77.1 | 17 | |
| Multiclass Sentiment Classification | CMU-MOSI (test) | A734.7 | 16 | |
| Sentiment Analysis | ICT-MMMO (test) | A2 Score86.3 | 15 | |
| Sentiment Analysis | YouTube (test) | A3 Score54.2 | 15 | |
| Sentiment Analysis | MOUD (test) | A281.1 | 15 | |
| Speaker personality trait recognition | POM (test) | Confident (A^7)29.1 | 12 |