A machine learning model creates prediction from input variables (features) to the output. The output can be binary, such as whether a car is defect or not; it can also be numerical, such as the inflation rate of next year. A good machine learning model uses the minimum number of features to get the most accurate prediction. In many cases, a smaller number of features can gives better performance than a larger number of them. This is because additional features add noise and leads to overfitting. Thus the model does not do well with new test cases.
There is another reason for feature selection: computational speed. A model with a large number of features takes a long time to build, and take too long to apply in real time.
The third reason is sample size vs. the number of features. If the number of features is large, we will overfit the model and cannot create good prediction. In gene expression study, there are more than 100,000 features, but only a few hundred data points. In order to increase the precision of the model, we need to increase sample size. For supervised learning, this means more human labeling on additional data. Sometime this is not possible.
How do we go about selecting the best and smallest number of features? If we try all subset of n features, it will take times of computation. This is apparently infeasible.
Feature selection has been studied in machine learning field for the last 2 decades.The following 4 methods have been proposed:
- Forward search: This method adds feature 1 by 1. It's a greedy algorithm that does not always lead to the best solution
- Backward search: This method starts with all features, and substract 1 by 1, until the performance no longer improves. It is also a greedy algorithm.
- Adaptive Lasso
- L1-regularized Logistic Regression