Abstract:
There has been recently a significant boost to machine
learning with distributed data, in particular with the success of
federated learning. A common and very challenging setting is that of
vertical or feature partitioned data, when multiple data
providers hold
different features about common entities. In general, training needs
to be preceded by record linkage (RL), a step that finds the correspondence between
the observations of the datasets. RL is
prone to mistakes in the real world. Despite the importance of
the problem, there has been so far no formal assessment
of the way in which RL errors impact learning models. Work in the area
either use heuristics or assume that the optimal RL is known in
advance.
In this paper, we provide the first assessment of the problem for
supervised learning. For wide sets of losses, we provide technical conditions
under which the
classifier learned after noisy RL converges (with the data size) to the best classifier that would be learned from
mistake-free RL. This yields new insights on the way the
pipeline RL + ML operates, from the role of large margin
classification on dampening the impact of RL mistakes to clues on how to further
optimize RL as a preprocessing step to ML. Experiments on a large
UCI benchmark validate
those formal observations.