Abstract:
A recent series of theoretical works showed that the dynamics of neural
networks with a certain initialisation are well-captured by kernel methods.
Concurrent empirical work demonstrated that kernel methods can come close to
the performance of neural networks on some image classification tasks.
These results raise the question of whether neural networks only learn
successfully if kernels also learn successfully, despite being the more
expressive function class.
Here, we show that two-layer neural networks with *only a few neurons* achieve
near-optimal performance on high-dimensional Gaussian mixture classification
while lazy training approaches such as random features and kernel methods do
not.
Our analysis is based on the derivation of a set of ordinary differential
equations that exactly track the dynamics of the network and thus allow to
extract the asymptotic performance of the network as a function of
regularisation or signal-to-noise ratio.
We also show how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.