OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
About
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-Label Classification | Open Images (val) | mAP58.11 | 9 | |
| Audio Recognition | OpenImages-5K (test) | WER30.24 | 5 | |
| Image-to-Text Retrieval | OpenImages-5K (test) | R@139.4 | 3 | |
| Text-to-Image Retrieval | OpenImages-5K (test) | R@141.96 | 2 | |
| Audio-to-Text Retrieval | OpenImages-5K (test) | R@180.3 | 1 | |
| Text-Audio-to-Image Retrieval | OpenImages-5K (test) | R@157.06 | 1 | |
| Text-to-Audio Retrieval | OpenImages-5K (test) | R@178 | 1 |