Abstract:
We develop confidence bounds that hold uniformly over time for off-policy
evaluation in the contextual bandit setting. These confidence sequences
are based on recent ideas from martingale analysis and are
non-asymptotic, non-parametric, and valid at arbitrary stopping times.
We provide algorithms for computing these confidence sequences that
strike a good balance between computational and statistical efficiency.
We empirically demonstrate the tightness of our approach in terms of
failure probability and width and apply it to the ``gated deployment'' problem of safely upgrading a production contextual bandit system.