Spatial Aggregation for Scene Text Recognition

Abstract: Text recognition in natural images is an important research topic that has attracted widespread interest in recent years. Without character-level annotations, most existing state-of-the-art scene text recognition methods adopt CTC or attention-based decoders in the prediction stage to obtain the final word-level recognition results. However, these methods bring strong vocabulary reliance and fail to obtain satisfactory results when the predicting sample is out of the vocabulary in the training set. Moreover, predicting character-by-character in serial also limits efficiency. To solve these problems, in this paper, a new structure for the prediction stage is proposed to alleviate vocabulary reliance and accelerate prediction. In the new prediction stage, two classification layers are adopted on each feature vector to predict i) the character and ii) the order of the character in the word from the image region represented by the feature vector. Then, a spatial aggregation layer is designed to comprehensively integrate the character classification and the order estimation results to derive text recognition. In addition, a self-attention layer is adopted between the feature extraction stage and prediction stage to model the context. The experiment results on various benchmarks have demonstrated that compared with several state-of-the-art approaches, the proposed model achieves better performance in recognition accuracy and efficiency.