Abstract:
Recommendation systems operate in a highly stochastic and non-stationary environment. As the amount of user-specific information varies, the users’ interests themselves also change. This combination creates a dynamic setting where a single solution will rarely be optimal unless it can keep up with these transformations. One system may perform better than others depending on the situation at hand, thus making the choice of which system to deploy, even more difficult. We address these problems by using the Hierarchical Reinforcement Learning framework. Our proposed meta-bandit acts as a policy over options, where each option maps to a pre-trained, independent recommender system. This meta-bandit learns online and selects a recommender accordingly to the context, adjusting to the situation. We conducted experiments on real data and found that our approach manages to address the dynamics within the user’s changing interests. We also show that it outperforms any of the recommenders separately, as well as an ensemble of them.