Jan 20, 2014

Statistics and Data Mining

Probability theory is the core of data mining. In the early day of data mining, people count the frequent pattern of those who buy diaper also buy beer. This is essentially deriving a conditional probability buying beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).

Here is example: Suppose 5% of the shoppers buy diaper, and only 2% of shoppers buy both diaper and beer, what the likelihood of some buys diaper after buying beer?  The answer is a conditional probability.
The so-called “association rule” in data mining can be much clearly understood through these statistics terms. 

In machine learning, the core concept is predicting whether a data sample belong to a certain population (class). For example, we want to predict whether a person is a potential buyer, or a credit card transaction is fraudulent. Such prediction is always associated with a confidence score between 0 and 1. When the confidence score is 1, we are 100% sure about the prediction. Generating such confidence score requires statistical inference.

Statistics is essential when evaluating a data mining model. We use control group or conduct A/B test. For example, if we test with 1000 people in treatment group (Group A) vs. 1000 people in control group (Group B) and find Group A has better performance, is this result conclusive? In other words, is it statistically significant? That has to be answered by statistic knowledge.

In data collection phrase, statistics helps us to decide the size of training data (sample), and whether it is representative. Not to mention, some current popular machine learning methods such as logistic regression were originally developed in statistics community.

As mathematics provides foundation for physics, statistics has now become a foundation for machine learning. Its importance will become more prominent overtime in data mining.

Jan 3, 2014

Data Mining Conferences in 2014

A new year has come and exciting data mining conferences are lining up on the horizon. Here is a list of major conferences in this field. They are good places to socialize with other data science researchers and practitioners, to connect with potential candidates or future employers.

Last year, I missed KDD 2013. This year, this conference is on my top list. Here is a list of all major ones.


·                  SDM (SIAM Conference on Data Mining),  April 24-26, Philadelphia
·                  KDD (Knowledge Discover and Data Mining), Aug 24-27, New York City


·                  Strata (O'Reilley conference), Feb 11-13, Santa Clara, California
·                  Predictive Analytics World, March 16-21, San Francisco, California


·                  ICML (machine learning), June 21-26, Beijing, China
·                  AAAI (Artificial Intelligence), July 27-31, Quebec City, Canada
  • ICDM (International conference on Data Mining), Dec 14-17, Shenzhen, China

 Specialized area
·                  WWW  (web data, text mining), May 7-11, Seoul, Korea
·                  ACL (Natural Language processing), June 23-25, Baltimore
·                  SIGIR  (text mining), July 6-11, Gold Coast, Australia
·                  Interspeech (Speech mining), Sept 14-18, Singapore
·                  Recsys (Recommender system), Oct 6-10, Foster City

Jan 2, 2014

Hiring data scientists

Many times I am asked by friends and colleagues on who are data scientists. Many are interested in answers to a very practical question: “Who should I hire as a data scientist?”

In my practical experience in building data science teams, I have come to appreciate the following qualities:
  1. A fundamental understanding of machine learning. Ultimately data mining cannot exist without machine learning, which provides core technique. Thus a researcher in machine learning or related fields (such as natural language processing, computer vision, artificial intelligence, or bioinformatics) is an ideal candidate. They have studied different machine learning methods, and know the newest and best techniques to apply to a problem.   
  2. A sophisticated understanding of statistics and advanced mathematics. Such understanding requires years of training. Thus a Ph.D. degree is typically required for data scientists.
  3. Training in computer science. Ultimately, mining data is a way of computing. It requires design of computer algorithms that are efficient in memory (space) and time.  People who are trained in computer science understand the tradeoff of space and time in computer. They understand the basic concept of computational complexity. Someone who has majored in computer science would have this training ingrained in their DNA. 
  4. Good coding skill. We live in a big data era. In order to work with data, we write code to process them, clean them, and transform them. Then we need to create programs on big data platform, and test and improve the program constantly. All of these require good coding skill. Data mining is about implementation and testing. Programming skill is thus a core requirement.    
In hiring a data scientist, a few other qualifications are desirable but not required:
  1. Experience with big data. This enables someone to work in certain environments such as Hadoop, and use the tool fast. But such knowledge can be easily learned. 
  2. Knowledge of a specific program language. A good programmer can easily learn any new language quickly. In addition, there are many options to run big data program, from Python, to Java, to Scala. If a person masters any one of these languages, he can be very productive. 
A good data scientist who satisfies the 4 basic-skill requirements is hard to find today. Even though our universities train tens of thousands of them each year, the market demand is way higher than that. Many people have read this report by McKinsey, which states that there will be 140,000 job gap (higher demand than talent supply) for data scientists by 2018.

Even today, in early 2014, companies are struggling to bring in data scientists. Those who are on the job market are immediately snatched away by large and well-known companies.  Today, every company is trying to implement “data strategy” (or “big data strategy” in its fancier term).  This is a golden age for data scientists but a challenging time for employers.