Abstract:
Scene text in natural images often has a complex and varied appearance and a variety of degradations, which pose a great challenge to the reliable recognition of text. In this paper, we propose a novel scene text recognition method that introduces an effective, end-to-end trainable text image enhancement network prior to an attention-based recognition network, which adaptively improves the text image and enhances the performance of the whole recognition model. Specifically, the enhancement network combines a novel hierarchical residual enhancement network, which generates and refines pixel-wise enhancement details that are added to the input text image, and a spatial rectification network regularizing the shape of the text. Through end-to-end training with the recognition network in a weak supervision way with word annotations only, the enhancement network effectively learns to transform the text image to a more favorable form for subsequent recognition. The state-of-the-art results on several standard benchmarks demonstrate the effectiveness of our enhancement-based scene text recognition method.