Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Supervised Multimodal Bitransformers for Classifying Images and Text

About

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine• 2019

Related benchmarks

TaskDatasetResultRank
Multimodal Multilabel ClassificationMM-IMDB (test)
Macro F163.2
87
Hateful Meme DetectionHateful Memes (test)
AUROC0.7286
67
Multimodal Multiclass ClassificationFood-101 (test)
Accuracy93.2
45
Hateful meme classificationHarM (test)
AUC85.48
31
Multi-class classificationHarMeme Harm-C corrected (test)
F1 Score54.4
28
Multi-class classificationHarMeme Harm-P corrected (test)
F1 Score47.1
28
Multimodal ClassificationUPMC Food-101 (test)
Accuracy94.1
28
Binary ClassificationHarMeme Harm-C corrected (test)
F1 Score78
28
Binary ClassificationHarMeme Harm-P corrected (test)
F1 Score64.9
28
Multimodal ClassificationSNLI-VE (test)
Accuracy74.69
22
Showing 10 of 22 rows

Other info

Code

Follow for update