Abstract:
A practical large scale product recognition system suffers from the phenomenon of long-tailed imbalanced training data under the E-commercial circumstance at Alibaba. In addition to images of products at Alibaba, plenty of related side information (e.g. title and tags) reveal rich semantic information about images. Prior works mainly focus on addressing the long tail problem from the visual perspective only, but lack of consideration of leveraging the side information. In this paper, we present a novel side information based large scale visual recognition co-training (SICoT) system to deal with the long tail problem by leveraging the image related side information. In the proposed co-training system, we firstly introduce a bilinear word attention module which aims to construct a semantic embedding from the noisy side information. A visual feature and semantic embedding co-training scheme is then designed to transfer knowledge between those classes with abundant training data (head classes) to classes with few training data (tail classes) in an end-to-end fashion. Extensive experiments on four challenging large scale datasets, whose numbers of classes range from one thousand to one million, demonstrate the scalable effectiveness of the proposed SICoT system in alleviating the long tail problem.