Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

About

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)--
2731
Image ClassificationImageNet-1k (val)
Top-1 Accuracy89.8
1453
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy82.6
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy82.5
466
Image-to-Text RetrievalFlickr30K 1K (test)
R@197.6
439
Text-to-Image RetrievalFlickr30k (test)
Recall@173.4
423
Text-to-Image RetrievalFlickr30K 1K (test)
R@189.6
375
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy88.8
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy92.6
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.942
333
Showing 10 of 66 rows

Other info

Code

Follow for update