Abstract:
We propose an efficient algorithm to train a very deep and thin network with theoretic guarantee. Our method is motivated by model compression, and consists of three stages. In the first stage, we widen the deep thin network and train it until convergence. In the second stage, we use this well trained deep wide network to warm up or initialize the original deep thin network. In the last stage, we train this well initialized deep thin network until convergence. The key ingredient of our method is its second stage, in which the thin network is gradually warmed up by imitating the intermediate outputs of the wide network from bottom to top. We establish theoretical guarantee using mean field analysis. We show that our method is provably more efficient than directly training a deep thin network from scratch. We also conduct empirical evaluations on image classification and language modeling. By training with our approach, ResNet50 can outperform ResNet101 which is normally trained as in the literature, and BERT_BASE can be comparable with BERT_LARGE.