Pages

Dec 12, 2012

Feature selection – An essential step of machine learning

A machine learning model creates prediction from input variables (features) to the output. The output can be binary, such as whether a car is defect or not; it can also be numerical, such as the inflation rate of next year. A good machine learning model uses the minimum number of features to get the most accurate prediction. In many cases, a smaller number of features can gives better performance than a larger number of them. This is because additional features add noise and leads to overfitting. Thus the model does not do well with new test cases.

There is another reason for feature selection: computational speed. A model with a large number of features takes a long time to build, and take too long to apply in real time.

The third reason is sample size vs. the number of features. If the number of features is large, we will overfit the model and cannot create good prediction. In gene expression study, there are more than 100,000 features, but only a few hundred data points.  In order to increase the precision of the model, we need to increase sample size. For supervised learning, this means more human labeling on additional data. Sometime this is not possible.

How do we go about selecting the best and smallest number of features? If we try all subset of n features, it will take times of computation. This is apparently infeasible.

Feature selection has been studied in machine learning field for the last 2 decades.The following 4 methods have been proposed:
  1. Forward search: This method adds feature 1 by 1. It's a greedy algorithm that does not always lead to the best solution
  2. Backward search: This method starts with all features, and substract 1 by 1, until the performance no longer improves. It is also a greedy algorithm.
  3. Adaptive Lasso
  4. L1-regularized Logistic Regression 
I am in favor of the last method, L1-regularized LR, as it consistently gives us the best performance among all feature selection methods. Hopefully in future blogs, I can expand on this discussion  a little more.

5 comments:

  1. Those hints do not only exercise to statistics mining items, i assume those tips will also be relevant in order to other campaigns. Here is also nice article http://www.buzzfeed.com/alicecalch/15-instagrammers-wholl-make-you-want-to-be-on-the-2004z All those are helpful components as well as individual references for on line as well as these are robust points with regard to considering in having a excellent and make it attractive to the actual viewers.

    ReplyDelete
  2. I like your site and content. thanks for sharing the information keep updating, looking forward for more posts. Thanks
    Gabung Disini

    ReplyDelete
  3. If you've got an iPhone 5 (or later, an iPad 4 (or later), an iPad mini 2 or later or a sixth-gen iPod touch, your device is officially rated as iOS 10-compatible, and you can update to iOS 10 for free from Cydia Nerd.you can get iOS 10 Download link from CydiaNerd.

    ReplyDelete
  4. Thank you regarding offering latest revisions about the problem, My partner and i enjoy examine a lot more.

    ReplyDelete