Abstract:
As machine learning black boxes are increasingly being deployed in real-world applications, there has been a growing interest in developing post hoc explanations that summarize the behaviors of these black box models. However, existing algorithms for generating such explanations have been shown to lack robustness with respect to shifts in the underlying data distribution. In this paper, we propose a novel framework for generating robust explanations of black box models based on adversarial training. In particular, our framework optimizes a minimax objective that aims to construct the highest fidelity explanation with respect to the worst-case over a set of distribution shifts. We instantiate this algorithm for explanations in the form of linear models and decision sets by devising the required optimization procedures. To the best of our knowledge, this work makes the first attempt at generating post hoc explanations that are robust to a general class of distribution shifts that are of practical interest. Experimental evaluation with real-world and synthetic datasets demonstrates that our approach substantially improves the robustness of explanations without sacrificing their fidelity on the original data distribution.