Abstract:
Prediction problems often admit competing models that perform almost equally well. This effect – called the multiplicity of good models – challenges how we build and deploy predictive models. In this paper, we study a specific notion of model multiplicity – predictive multiplicity – where competing models assign conflicting predictions. Predictive multiplicity signals irreconcilable differences in the predictions of competing models. In applications such as recidivism prediction and credit scoring, evidence of predictive multiplicity challenges model selection and downstream processes that depend on it. We propose measures to evaluate the severity of predictive multiplicity in classification, and develop integer programming methods to compute them efficiently. We apply our methods to evaluate predictive multiplicity in recidivism prediction problems. Our results show that real-world datasets may admit competing models that assign wildly conflicting predictions, and support the need to measure and report predictive multiplicity in model development.