Vision-Language Pre-Training for Boosting Scene Text Detectors
About
Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Detection | ICDAR 2015 | Precision91.5 | 171 | |
| Scene Text Detection | ICDAR 2015 (test) | F1 Score86.5 | 150 | |
| Text Detection | Total-Text | Recall82 | 139 | |
| Scene Text Detection | TotalText (test) | Recall84 | 106 | |
| Text Detection | MSRA-TD500 | Precision88.5 | 84 | |
| Text Detection | CTW1500 | F-measure83.3 | 70 | |
| Scene Text Detection | MSRA-TD500 (test) | Precision92.3 | 65 | |
| Scene Text Detection | ICDAR 2017 | Precision77.7 | 5 |