Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

About

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Andrea Maracani, Savas Ozkan, Sijun Cho, Hyowon Kim, Eunchung Noh, Jeongwon Min, Cho Jung Min, Dookun Park, Mete Ozay• 2025

Related benchmarks

TaskDatasetResultRank
Scene Text RecognitionSVTP (test)
Word Accuracy98.3
153
Scene Text RecognitionIC15
Accuracy92.7
86
Scene Text RecognitionCUTE80
Accuracy99.7
47
Scene Text RecognitionUber-Text (test)
Word Accuracy93.2
35
Scene Text RecognitionSVT 647 images
Accuracy99.2
33
Scene Text RecognitionIC15 (2077 samples)
Word Accuracy92.2
16
Scene Text RecognitionCOCO 9825 samples
Word Accuracy83.4
16
Scene Text RecognitionIIIT5K 3000 samples
Word Accuracy99.5
16
Scene Text RecognitionArT 34k samples
Word Accuracy86.4
16
Scene Text RecognitionHOST
Word Accuracy84.3
14
Showing 10 of 14 rows

Other info

Follow for update