Probability theory is the core of data mining. In the early day of data mining, people count the frequent pattern of those who buy diaper also buy beer. This is essentially deriving a conditional probability buying beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).
Here is example: Suppose 5% of the shoppers buy diaper, and only 2% of shoppers buy both diaper and beer, what the likelihood of some buys diaper after buying beer? The answer is a conditional probability.
The so-called “association rule” in data mining can be much clearly understood through these statistics terms.
In machine learning, the core concept is predicting whether a data sample belong to a certain population (class). For example, we want to predict whether a person is a potential buyer, or a credit card transaction is fraudulent. Such prediction is always associated with a confidence score between 0 and 1. When the confidence score is 1, we are 100% sure about the prediction. Generating such confidence score requires statistical inference.
Statistics is essential when evaluating a data mining model. We use control group or conduct A/B test. For example, if we test with 1000 people in treatment group (Group A) vs. 1000 people in control group (Group B) and find Group A has better performance, is this result conclusive? In other words, is it statistically significant? That has to be answered by statistic knowledge.
In data collection phrase, statistics helps us to decide the size of training data (sample), and whether it is representative. Not to mention, some current popular machine learning methods such as logistic regression were originally developed in statistics community.