Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

About

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Andrea Maracani, Savas Ozkan, Sijun Cho, Hyowon Kim, Eunchung Noh, Jeongwon Min, Cho Jung Min, Dookun Park, Mete Ozay• 2025

Related benchmarks

Task	Dataset	Result
Scene Text Recognition	SVTP (test)	Word Accuracy98.3	153
Scene Text Recognition	IC15	Accuracy92.7	98
Scene Text Recognition	CUTE80	Accuracy99.7	59
Scene Text Recognition	Uber-Text (test)	Word Accuracy93.2	35
Scene Text Recognition	SVT 647 images	Accuracy99.2	33
Scene Text Recognition	Union14M (test)	Curve Accuracy97	22
Scene Text Recognition	IC15 (2077 samples)	Word Accuracy92.2	16
Scene Text Recognition	COCO 9825 samples	Word Accuracy83.4	16
Scene Text Recognition	IIIT5K 3000 samples	Word Accuracy99.5	16
Scene Text Recognition	ArT 34k samples	Word Accuracy86.4	16

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord