ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

About

In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter.

Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin• 2023

Related benchmarks

Task	Dataset	Result
Text Detection	Total-Text (test)	F-Measure90	126
Text Detection	ICDAR 2015 (test)	F1 Score91	108
Scene Text Detection	TotalText (test)	Recall88.1	106
Scene Text Spotting	Total-Text (test)	F-measure (None)80.8	105
Text Detection	CTW1500	F-measure90	98
Scene Text Detection	Total-Text	Precision92	79
End-to-End Text Spotting	ICDAR 2015 (test)	Generic F-measure78.1	62
Text Spotting	ICDAR 2015 (test)	Accuracy (Strong Lexicon)87.5	36
End-to-End Text Spotting	SCUT-CTW1500 (test)	F-Measure (None Config)66	34
Scene Text Detection	ICDAR 2015	Precision92.5	25

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord