Abstract:
The task of combining complimentary pieces of clothing into an outfit is familiar to most people, but has thus far proved difficult to automate. We present a model that uses multimodal embeddings of pieces of clothing based on images and textual descriptions. The embeddings and a shared style space are trained end to end in a novel deep neural network architecture. The network is trained on the largest and richest labelled outfit dataset made available to date, which we open source. This is the first public expert created, labelled dataset and contains 586,320 labelled outfits. We evaluate the performance of our model using an AB test and compare it to a template based model that selects items from the correct classes, but ignores style. Our experiments show that our model outperforms by 21% and 34% for womenswear and menswear respectively.