Probability theory is the core of data mining. In the early
day of data mining, people count the frequent pattern of those who buy diaper
also buy beer. This is essentially deriving a conditional probability buying
beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).

Here is example: Suppose 5% of the shoppers buy diaper, and
only 2% of shoppers buy both diaper and beer, what the likelihood of some buys
diaper after buying beer? The answer is
a conditional probability.

The so-called “association rule” in data mining can be much
clearly understood through these statistics terms.

In machine learning, the core concept is predicting whether
a data sample belong to a certain population (class). For example, we want to
predict whether a person is a potential buyer, or a credit card transaction is
fraudulent. Such prediction is always associated with a confidence score
between 0 and 1. When the confidence score is 1, we are 100% sure about the
prediction. Generating such confidence score requires statistical inference.

Statistics is essential when evaluating a data mining model.
We use control group or conduct A/B test. For example, if we test with 1000
people in treatment group (Group A) vs. 1000 people in control group (Group B)
and find Group A has better performance, is this result conclusive? In other
words, is it

*statistically significant*? That has to be answered by statistic knowledge.
In data collection phrase, statistics helps us to decide the
size of training data (sample), and whether it is representative. Not to
mention, some current popular machine learning methods such as logistic
regression were originally developed in statistics community.