TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
About
Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER9 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER4.1 | 833 | |
| Image Classification | CIFAR10 (test) | Accuracy98.95 | 585 | |
| Natural Language Understanding | GLUE | SST-296.4 | 452 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER8.9 | 411 | |
| Automatic Speech Recognition | LibriSpeech (dev-clean) | WER (%)3.8 | 319 | |
| Image Classification | CIFAR100 (test) | Top-1 Accuracy92.12 | 2 |