Pages

Jan 20, 2014

Statistics and Data Mining

Probability theory is the core of data mining. In the early day of data mining, people count the frequent pattern of those who buy diaper also buy beer. This is essentially deriving a conditional probability buying beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).

Here is example: Suppose 5% of the shoppers buy diaper, and only 2% of shoppers buy both diaper and beer, what the likelihood of some buys diaper after buying beer?  The answer is a conditional probability.
The so-called “association rule” in data mining can be much clearly understood through these statistics terms. 

In machine learning, the core concept is predicting whether a data sample belong to a certain population (class). For example, we want to predict whether a person is a potential buyer, or a credit card transaction is fraudulent. Such prediction is always associated with a confidence score between 0 and 1. When the confidence score is 1, we are 100% sure about the prediction. Generating such confidence score requires statistical inference.

Statistics is essential when evaluating a data mining model. We use control group or conduct A/B test. For example, if we test with 1000 people in treatment group (Group A) vs. 1000 people in control group (Group B) and find Group A has better performance, is this result conclusive? In other words, is it statistically significant? That has to be answered by statistic knowledge.

In data collection phrase, statistics helps us to decide the size of training data (sample), and whether it is representative. Not to mention, some current popular machine learning methods such as logistic regression were originally developed in statistics community.

As mathematics provides foundation for physics, statistics has now become a foundation for machine learning. Its importance will become more prominent overtime in data mining.

16 comments:

  1. very informative. Big data analysis can be also applied into our SEO job. By analyzing the key words that users used to search the site, we can know what they are interested in and how to keep them interesting. BOC Sciences achieved a great progress with this method.

    ReplyDelete
  2. very informative. Big data analysis can be also applied into our SEO job. By analyzing the key words that users used to search the site, we can know what they are interested in and how to keep them interesting. BOC Sciences achieved a great progress with this method.

    ReplyDelete
  3. Do not worry jamu agar tahan lama di ranjang if your penis an erection or not. cara mengatasi ejakulasi dini pada pria The more you are anxious and cara membuat penis kuat dan tahan lama think about it would make more cara pemesanan spirulina difficult the erection occurs or even erection cara pemesanan cordyceps plus capsule that had been hard even suddenly flaccid again. obat kuat tradisional Do not be nervous and remained relaxed and confident with your erection. All you have to understand that, a ptia not always get an erection in any occasion.

    ReplyDelete
  4. Thanks for sharing this interesting and useful things with all of us. this is what I am looking for.
    CP

    ReplyDelete
  5. I found a lot of entertaining products in your weblog, especially the discussion. free itune codes

    ReplyDelete

  6. this is great atikel ,, i like it,, thank for u sharing info for all people Situs Agen Judi Piala Eropa 2016 Terpercaya

    ReplyDelete
  7. Want to watch the most important match of Euro Cup 2016, the big final live from your home,just check out Euro Cup Final Live Streaming

    ReplyDelete
  8. If you love cats, you must check out my blog National Cat Day
    which is solely dedicated to cats. Learn everything you want to know about cats and stay updated. Enjoy with your pets on Cat Day and make her feel special.

    ReplyDelete
  9. Want to know about Veterans Day,which is a day to celebrate the great officers of the US army,just checkout my blog Veterans Day 2016 and stay updated.

    ReplyDelete