Abstract:
Supervised learning requires the specification of a loss function to minimise.
While the theory of admissible losses from both a computational and statistical perspective is well-developed,
these offer a panoply of different choices.
In practice, this choice is typically made in an \emph{ad hoc} manner.
In hopes of making this procedure more principled,
the problem of \emph{learning the loss function} for a downstream task (e.g., classification) has garnered recent interest.
However, works in this area have been generally empirical in nature.
In this paper,
we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens,
derive a generalisation based on Bregman divergences,
and show how it provides a principled procedure for learning the loss.
In detail,
we cast
{\sc SLIsotron}
as learning a loss from a family of composite square losses.
By interpreting this through the lens of \emph{proper losses},
we derive a generalisation of {\sc SLIsotron} based on Bregman divergences.
The resulting {\sc BregmanTron} algorithm
jointly learns the loss along with the classifier.
It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule.
Experiments indicate that the {\sc BregmanTron} significantly outperforms the {\sc SLIsotron}, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of \textit{loss transfer} between domains.