Nov 11, 2013

Overview of data mining

When people talk about data mining, sometimes they refer to the methods used in this field, such as machine learning.Sometimes they refer to specific data of interest, such as text mining or video mining. Many other times, people use the term "big data" simply referring to the infrastructure of data mining, such as Hadoop or Cassandra. 

Given all of these various references, newcomers of the data mining field feel like lost in a wonderland. 
Here we give an overview picture that connects all these components together. Essentially we can think the data mining field consists of 4 major layers. At the top layer is the basic methodology, such as machine learning or frequent pattern mining. The second layer is the application to various data types. For social networks, this is graph mining. For sensors and mobile data, this is stream data mining. The third layer is the infrastructure, where Hadoop, NoSQL and other environments are invented to support large data movement and retrieval. Finally, at the fourth layer, we are concerned creating a data mining team, understanding the people profiles that support all of the above operations. 

Given this four-layer separation, we can easily see how various discussions on data mining fall in the picture. For example, the hot topic of "deep learning" belongs to the machine learning layer. More specifically, it is part of Unsupervised Learning.  Another topic "Natural Language Processing" (NLP) is part of mining  text data. Not surprisingly,  this field uses machine learning extensively as its core methodology.