Abstract:
Feature selection problems have been extensively studied in the setting of linear estimation, for instance LASSO, but less emphasis has been placed on feature selection for neural networks.
In this study, we propose a method for feature selection in non-linear function estimation problems. The new procedure is based on directly penalizing the $\ell_0$ norm of features, or the count of the number of selected features. Our $\ell_0$ based regularization relies on a continuous relaxation of the Bernoulli distribution, which allows our model to learn the parameters of the approximate Bernoulli distributions via gradient descent. The proposed framework simultaneously learns a non-linear regression or classification function while selecting a small subset of features. We provide an information-theoretic justification for incorporating Bernoulli distribution for feature selection. Furthermore, we evaluate our method using synthetic and real-life data and demonstrate that our approach outperforms other commonly used methods in terms of predictive performance and feature selection.