Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer
About
Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting and the first text spotting framework which may be trained with both fully- and weakly-supervised settings. By learning a single latent representation per word detection, and using a novel loss function based on the Hungarian loss, our method alleviates the need for expensive localization annotations. Trained with only text transcription annotations on real data, our weakly-supervised method achieves competitive performance with previous state-of-the-art fully-supervised methods. When trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Detection | ICDAR 2015 (test) | F1 Score85.2 | 108 | |
| Scene Text Spotting | Total-Text (test) | F-measure (None)78.2 | 105 | |
| End-to-End Text Spotting | ICDAR 2015 | Strong Score85.2 | 80 | |
| End-to-End Text Spotting | ICDAR 2015 (test) | Generic F-measure77.4 | 62 | |
| End-to-End Scene Text Spotting | Total-Text | Hmean (None)78.2 | 55 | |
| Word Spotting | ICDAR 2015 | Strong Score85 | 42 | |
| Word Spotting | ICDAR 2015 (test) | F-score (Strong lexicon)85 | 36 | |
| Text Spotting | ICDAR 2015 (test) | Accuracy (Strong Lexicon)81.7 | 36 | |
| Scene Text Spotting | Total-Text | F-measure (None)78.2 | 23 | |
| End-to-end Recognition | Total-Text | -- | 22 |