Nov 11, 2013

Overview of data mining

When people talk about data mining, sometimes they refer to the methods used in this field, such as machine learning.Sometimes they refer to specific data of interest, such as text mining or video mining. Many other times, people use the term "big data" simply referring to the infrastructure of data mining, such as Hadoop or Cassandra. 

Given all of these various references, newcomers of the data mining field feel like lost in a wonderland. 
Here we give an overview picture that connects all these components together. Essentially we can think the data mining field consists of 4 major layers. At the top layer is the basic methodology, such as machine learning or frequent pattern mining. The second layer is the application to various data types. For social networks, this is graph mining. For sensors and mobile data, this is stream data mining. The third layer is the infrastructure, where Hadoop, NoSQL and other environments are invented to support large data movement and retrieval. Finally, at the fourth layer, we are concerned creating a data mining team, understanding the people profiles that support all of the above operations. 

Given this four-layer separation, we can easily see how various discussions on data mining fall in the picture. For example, the hot topic of "deep learning" belongs to the machine learning layer. More specifically, it is part of Unsupervised Learning.  Another topic "Natural Language Processing" (NLP) is part of mining  text data. Not surprisingly,  this field uses machine learning extensively as its core methodology. 

Sep 12, 2013

How casinos are betting on big data

Reported from Yahoo! Finance today, casinos like Caesars crunch big data to find ways to attract gamblers. They can find out if a gambler is losing  or winning too much:  
"They could win a lot or they lose a lot or they could have something in the middle. … So we do try to make sure that people don't have really unfortunate visits," said Caesars CEO Gary Loveman. 

Caesars has been leading data mining in gambling industry. They have more than 200 data experts (including some real data scientists?) in house to crunch data on royalty program, VIP member patterns, and so on.  That work was reported in KDD 2011 Industry prattice expo.

Next time when you visit a casino, expect a suddenly friendlier slot machine after you are on a losing streak ...

The complete Yahoo! Finance report is here

Sep 4, 2013

Machine learning as a service

With data widely available and maturity of machine learning methods, commercial application of machine learning has become a reality. A new type of businesses has sprung up: Providing machine learning services.  It turns out this service fills a big market void. About a dozen companies have rapidly grown in this space in the last few years. Even so, many more startups are formed this year.

Let me take a look at a few successful companies in this space:

1. Mu Sigma
The company derives its name from “mu” and “sigma” in statistics. Based in Chicago but with significant operation in India, Mu Sigma provides analytics for marketing, supply chain and risk management. The company was founded in 2004, but has grown to 2,500 employees. Today, the company is valued at 1 billion.
2. Opera Solutions
Opera solutions was founded in 2004, headerquartered in New York but with significant operation in San Diego. The company gathered a group of strong data scientists to work on projects ranging from financial service, health care, insurance to manufacturing. The company scientists participated in important data mining contests such as KDD cup and Netflix contest, demonstrating strong technical and research skills.
Opera Solutions works with large to mid-sized companies, and has been growing rapidly. It raised $84 million funding in late 2011, and then another $30 million in May 2013. Today the company is valued at $500 Million.

3. Actian
       Based in Redwood city, CA, Actian provides data analytic platform.  They acquired another data company ParAccel recently.

4. Fractal analytics
    The company was founded in 2000, and headquartered in San Mateo. It has significant operation presence in India. It provides business intelligence and analytic service for financial service, insurance, and telecommunication industries.  

6. Grok
They provide asset (equipment) management for manufacturers, and predictive modeling. The company was founded in 2005, and based in Redwood city.

5. Alpine Data labs
     They offer in-house consulting and 24-hour initial delivery.  It raised $7.5 million round A funding in May 2011. The company is based in San Mateo.

7. Skytree
The company specifically focuses on machine learning. It just raised $18 million round A funding in April 2013, after starting in early 2012 (with $1.5 million seed fund). The company is based in San Jose.

Jul 26, 2013

Learning to rank and recommender systems

A recommendation problem is essentially a ranking problem: Among a list of movies, which should rank higher in order to be recommended? Among the job candidates, who should LinkedIn display to the recruiters?  The task of recommendation can be viewed as creating a ranked list.

Classical approach to recommender systems is based on collaborative filtering. This is an approach using similar users or similar items to make recommendation.  Collaborative filtering is popularized by the Netflix contest from 2006 to 2009, when many teams around the world participated to create movie recommendation based on movie ratings provided by Netflix. 

While collaborative filtering has achieved certain success, it has its limitations. The fundamental problem is the limited information captured in user-item table. Each cell of this table is either a rating or some aggregated activity score (such as purchase) on an item (from a specific user). Complex information such as user browsing time, clicks, or external events is hard to capture in such table format.

A ranking approach to recommendation is much more flexible. It can incorporate all the information as different variables (features). Thus it is more explicit. In addition, we can combine ranking with machine learning, allowing ranking function evolve over time based on data. 

In traditional approach to ranking, a ranking score is generated by some fixed rules. For example, a page’s score depends on links pointing to that page, its text content, and its relevance to search keywords. Other information such as visitor’s location, or time of day could all be part of the ranking formula. In this formula, variables and their weights are pre-defined.

The idea of Learning to Rank is using actual user data to create a ranking function. The machine learning procedure for ranking has the following steps:
  1. Gather training data based on click information.
  2. Gather all attributes about each data point, such as item information, user information, time of day etc. 
  3. Create a training dataset that has 2 classes: positive (Click) and negative (no click).
  4. Apply a supervised machine learning algorithm (such as logistic regression) to the training data
  5. The learned model is our ranking model. 
  6. For any new data point, the ranking model assigns a probability score between 0 and 1 on whether the item will be clicked (selected). We call this probability score our “ranking score”. 

The training data are constructed from user visit logs containing user clicks or it may be prepared manually by human raters.

The learning to rank approach to recommendation has been adopted by Netflix and LinkedIn today. It is fast and can be trained repeatedly. It is behind those wonderful movie recommendations and connection recommendations we enjoy on these sites. 

Jul 9, 2013

Text Mining: Name Entity Detection (2)

In the machine learning approach for detecting named entities, we first create labeled data, which are sentences with each word tagged. The tags correspond to the entities of interest. For example,
May      Smith   was  born   in   May.
        Person  Person                        Date
Since an entity may have several words, we can use finer tags in IOB (in-out-begin) format, which indicates whether the word is the beginning (B) or inside (I) of a phrase. We will have tags for B-Person, I-Person, B-Date, I-Date and O (other).  With such tagging, the above example becomes
May          Smith      was  born  in    May.
B-Person  I-Person   O     O    O   B-Date

Our training data have every word tagged. Our goal is learning about this mapping and apply it a new sentence. In other words, we want to find a mapping from a word (given its context) to an entity label.

Each word can be represented as a set of features. A machine learning model maps input features to an output (entity class). Such features could include:
Word identity (the word itself)
Word position from the beginning.
Word position from the end.
the word before
the word after
Is the word capitalized?

Our training data would look like the following, where each row is a data point.
Word identity
position from the beginning
position to the end
word before
word after
Is capitalized?

Once we create the training data like the above table, we can then train a standard machine learning method such as SVM, decision tree or logistic regression to generate a model (which contains machine generated rules). We can then apply this model to any new sentence and detect entity types related to "Person" and "Date".

Note that we can only detect entity types contained in our training data. While this may seem a limitation, it allows us to discover new instances or values associated with existing entity types.

The approach we have discussed so far is a supervised learning approach, which depends heavily on human-labeled data. When manual labels are hard to come by, we can find ways to create training data by machine. Such approach is called semi-supervised learning, which holds a lot of promise when data get large.

Jul 8, 2013

Text Mining: Named Entity Detection

An interesting task of text mining is detecting entities in the text. Such entities could be a person, a company, a product, or a location. Since an entity is associated with a special name, it is also called Named Entity. For example, the following text contains 3 named entities:
       Apple has hired Paul Deneve as vice president, reporting to CEO Tim Cook.
The first term “Apple” indicates a company, and the second and third are persons.

Named entity detection (NER) is an important component in  social media analysis. It helps us to understand user sentiment on specific products. NER is also important for product search for E-commerce companies. It helps us to understand user search query related to certain products.

To map each name to an entity, one solution is using a dictionary of special names. Unfortunately, this approach has two serious problems. The first problem is that our dictionary is not complete. New companies are created and new products are sold every day. It is hard to keep track all the new names. The second problem is the ambiguity of associating a name to an entity. The following example illustrates this:
As Washington politicians argue about the budget reform, it is a good time to look back at George Washington’s time.
In this text, the first mention of “Washington” refers to a city, while the second mention refers to a person. The distinction of these two entities comes from their context.

To resolve ambiguity in entity mapping,  we can create certain rules to utilize the context. For example, we can create the following rules:
  1. When ‘Washington’ is followed by ‘politician’, then it refers to a city.
  2. When ‘Washington’ is preceded by ‘in’, then it refers to a city.
  3. When ‘Washington’ is preceded by ‘George’, then it refers to a person.
But such rules could be too many. For example, each of the following phrases would generate a different rule: “Washington mentality”, “Washington atmosphere”, “Washington debate” as well as “Washington biography” and “Washington example”. The richness of natural language makes the number of rules exploding and still susceptible to exceptions.

Instead of manually creating rules, we can apply machine learning. The advantage of machine learning is that it creates patterns automatically from examples. No rule needs to be manually written by humans. The machine learning algorithm takes a set of training examples, and chunk out its own model (that is comparable to rules). If we get new training data, we can re-train the machine learning algorithm and generate a new model quickly.

How does the machine learning approach work? I will discuss it in the next post. 

Apr 30, 2013

The future of video mining

IP camera has become inexpensive and ubiquitous. For $149, you can get an IP camera to use at home. This camera saves recording in the cloud, and you never have to worry about buying memory cards. Dropcam offers this amazing service. The camera connects to your wi-fi network and stream real-time video 24 hours a day to the server. You can watch your home remotely from your smart phone or laptop.

Dropcam is a startup based in San Francisco. Founded in 2009, the company has raised 2 rounds of funding, with $12 million of Series B in June 2012. As a young startup, the company has seen rapid user growth. Dropcam CEO Greg Duffy said that Dropcam cameras now upload more video per day than YouTube.

What makes Dropcam unique from other IP camera company is its video mining capability. If you want to review your video of the last 7 days, you don’t have to watching them from the beginning to the end. Dropcam automatic mark the video segment with motion, so that you can jump to those segments right away.

In addition, Dropcam plans to implement face and figure detection so that it can give more intelligent viewing options. Furthermore, the video view software can detect events such as cat running, dinner gathering and so on. The potential of video mining is limitless. Given our limited time to review videos of many hours (or days and weeks) and continuing accumulation of our daily recording, the need for video mining will keep growing.

Apr 29, 2013

Stroke Prediction

Stroke is the third leading cause of death in the United States. It is also the principal cause of serious long-term disability. Stroke risk prediction can contribute significantly to its prevention and early treatment. Numerous medical studies and data analyses have been conducted to identify effective predictors of stroke.

Traditional studies adopted features (risk factors) that are verified  by  clinical trials  or  selected manually by medical experts. For example, one famous study by Lumley and others[1] built a 5-year  stroke prediction model using a set of 16 manually selected features. However, these manually selected features could miss some important indicators. For example,  past studies  have shown that there exist additional risk factors  for stroke such  as  creatinine level,  time  to  walk  15 feet,  and others.

The Framingham Study [2] surveyed a wide range of stroke risk factors including blood pressure, the use of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease, and atrial fibrillation. With  a large  number of features in current medical  datasets, it  is a  cumbersome task to  identify  and  verify  each  risk  factor  manually.  Machine learning algorithms are capable of identifying features highly related to stroke occurrence efficiently from the huge set of features. By doing so, it can improve the prediction accuracy of stroke risk, in addition to discover new risk factors.

In a study by Khosla and others [3], a machine-learning based predictive model was built on stroke data, and several feature selection methods were investigated. Their model was based on automatically selected features. It outperformed existing stroke model. In addition, they were able to identify risk factors that have not been discovered by traditional approaches. The newly identified factors include:

Total medications
Any ECG  abnormality
Min.  ankle  arm  ratio
Maximal  inflation level
Calculated 100 point score
General  health
Minimental score 35 point

It’s exciting to see machine learning play a more important role in medicine and health management.


[1]  T.  Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February 2002.
[2] P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: A risk profile from the Framingham study. Stroke, 22:312{318, March 1991.
[3] Aditya Khosla, Yu Cao, Cliff Chiung-Yu Lin, Hsu-Kuang Chiu, Junling Hu, and Honglak Lee. "An integrated machine learning approach to stroke prediction." In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183-192. ACM, 2010.

Apr 26, 2013

History of machine learning

The development of machine learning is an integral part of the development of artificial intelligence. In the early days of AI, people were interested in building machines that mimic human brains. The perceptron model was invented in 1957, and it generated over optimistic view for AI during 1960s. After Marvin Minsky pointed out the limitation of this model in expressing complex functions, researchers stopped pursuing this model for the next decade. .

In 1970s, the machine learning field was dormant, when expert systems became the mainstream approach in AI.  The revival of machine learning came in mid-1980s, when the decision tree model was invented and distributed as software. The model can be viewed by a human and is easy to explain. It is also very versatile and can adapt to widely different problems. It is also in mid 1980s multi-layer neural networks were invented, With enough hidden layers, a neural network can express any function, thus overcoming the limitation of perceptron. We see a revival of the neural network study.

Both decisions trees and neural networks see wide application in financial applications such as loan approval, fraud detection and portfolio management. They are also applied to a wide-range of industrial process and postal office automation (address recognition). 
Machine learning saw rapid growth in 1990s, due to the  invention of World-Wide-Web and large data gathered on the Internet. The fast interaction on the Intern called for more automation and more adaptivity in computer systems. Around 1995, SVM was proposed and have become quickly adopted. SVM packages like libSVM, SVM light make it a popular method to use. 

After year 2000, Logistic regression was rediscovered and re-designed for large scale machine learning problems . In the ten years following 2003, logistic regression has attracted a lot of research work and has become a practical algorithm in many large-scale commercial systems, particularly in large Internet companies.  

We discussed the development of 4 major machine learning methods. There are other method developed in parallel, but see declining use today in the machine field: Naive Bayes, Bayesian networks, and Maximum Entropy classifier (most used in natural language processing). 

In addition to the individual methods, we have seen the invention of ensemble learning, where several classifiers are used together, and its wide adoption today. 

New machine learning methods are still invented each day. For the newest development, please check out the annual ICML (International Conference on Machine Learning) conference. 

Apr 25, 2013

Yelp and Big data

In the Big Data Gurus meetup yesterday, hosted in Samsung R&D center in San Jose, Jimmy Retzlaff from Yelp gave a talk on Big data at Yelp.

By the end of March 2013, Yelp has 36 million user reviews. These reviews cover from restaurants to hair salon, and other local businesses. The number of reviews on Yelp website has grown exponentially in the last few years.

Yelp also sees high traffic now. In January 2013, there are 100 million unique visitors to Yelp. The website records 2 terabytes of log data and another 2 terabytes of derived log every day. While this data size is still small comparing to eBay or LinkedIn, it calls for implementation of big data infrastructure and data mining methods.

Yelp uses MapReduce extensively and builds its infrastructure on Amazon cloud.

Yelp’s log data contain ad display, user clicks and so on. Data mining helps Yelp in designing search system, showing ads, and filter fake reviews. In addition, data mining enables products such as "review highlights", "people who viewed this also viewed...".

Yelp is one example of companies starting to tackle big data and taking advantage of data mining for creating better services.

Apr 24, 2013

Machine Learning for Anti-virus Software

Symantec is the largest anti-virus software vendor. It has 120 million subscribers, who visit 2 billion websites a day and generate 700 billion submissions. Given such a large number of data, it is paramount that an anti-virus software can detect the virus fast and accurately.

Anti-virus software was originally built manually. Security expert review each malware and construct their “signature”. Each computer file is checked against such signatures. Given the rapid change of malware and many variations, there are not enough human experts to generate all the exact signatures. This gives rise to heuristic or generic signatures which can handle more variations of the same file. However, new types of malware are created every day. Thus we need a more adaptive approach to identify malware automatically (without manual effort of creating signatures). This is where machine learning can help.

Computer virus has come a long way. The first virus “creeper” appeared in 1971. Then we have Rabbit or Wabbit. After that came computer worms like “Love Letter” and Nimda. Today computer virus gets much more sophisticated. It evolves much faster and is constantly changing. Virus creation is now funded by organizations and some governments. There is big incentive to steal user financial information or companies’ trade secrets. In addition, malware enables certain governments to conduct spying or potential cyber war on their targets.

Symantec uses about 500 features for their machine learning model. The feature value can be continuous or discrete. Such features include:
How did it come this machine (through browser, email, ..)
How many other files on this machine?
How many clean files on this machine?
Is file packed or obfuscated? (mutated?)
Does it write, communicate?
How often does it run?
Who runs it?

Researchers at Symantec experiment with SVM, decision tree and linear regression models.

In building a classifier, they are not simply optimizing accuracy or true positive rate. They are also concerned false positive instances where a benign software was classified as malware. Such false positive prediction could have high cost for the users. The balance of true positive vs. false positive leads to using ROC (Receiver Operating Characteristic) curve.

An ROC curve plots the trade-off between true positive rate vs. false positive rate. Each point on the curve corresponds to a cutoff we choose. They use ROC curve to select a target point. Below is an illustration of the tradeoff.

The chart above suggests that when we aim for 90% true positive rate, we will have 20% false positive rate. However, when we only aim for 80% true positive rate, the false positive rate be reduced to 20%. (A better classifier could shift the ROC curve up, so that we achieve high true positive rate for any given false positive rate.)

According their researcher, Symantec has achieved high accuracy rate (the average of True positive and true negative rate) at 95%. Its true positive rate is above 98% and its false positive rate is below 1%.

I am a user of Norton software (by Symantec) and enjoy it. I hope to see more success from Symantec and we are winning the war against malware!

Mar 12, 2013

Basic steps of applying machine learning methods

Deploying a machine learning model typically takes the following five steps:

1. Data collection.
2. Data preprocessing:
    1) Data cleaning;
    2) Data transformation;
    3) Divide data into training and testing sets.
3. Build a model on training data.
4. Evaluate the model on the test data.
5. If the performance is satisfying, deploy to the real system.

This process can be iterative, meaning we can re-start from step 1 again. For example, after a model is deployed, we can collect new data and repeat this process. Let’s look at the details of each step:

1. Data Collection:  
        At this stage, we want to collect all relevant data. For an online business, user click, search queries, and browsing information should be all be captured and saved into the database.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.

2. Data Preprocessing:
       The data used in Machine Learning describes factors, attributes, or features of an observation.  Simple first steps in looking at the data include finding missing values.  What is the significance of that missing value?   Would replacing a missing data value with the median value for the feature be acceptable? For example, perhaps the person filling out a questionnaire doesn't want to reveal his salary.  This could be because the person has a very low salary or a very high salary.  In this case, perhaps using other features to predict the missing salary data might be appropriate.  One might infer the salary from the person’s zip code.  The fact that the value is missing may be important.  There are machine learning methods that ignore missing values and one of these could be used for this data set.
          2) Data Transformation: 
          In general we work with both numerical and categorical data.  Numerical data consists of actual   numbers, while categorical data have a few discrete values. Examples of categorical data include eye color, species type, marriage status, or gender.  Actually a zip code is categorical.  The zip code is a number but there is no meaning to adding two zip codes.  There may or may not be an order to categorical data.  For instance good, better, best is descriptive categorical data which has an order.

3) After the data has been cleaned and transformed it needs to be split into a training set and a test set.

3. Model Building:  
        This training data set is used to create the model which is used to predict the answers for new cases in which the answer or target is unknown.   For example, Section 1.3 describes how a decision tree is built using the training data set.  Several different modeling techniques have been introduced and will be discussed in detail in future sections.  Various models can be built using the same training data set.

4. Model Evaluation
         Once the model is built with the training data, it is used to predict the targets for the test data.  First the target values are removed from the test data set.  The model is applied to the test data set to predict the target values for the test data.  The predicted value of the target is then compared with the actual target value.  The accuracy of the model is the percentage of correct predictions made.  These accuracies of can be used to compare the different models.  Several other ways to compare model accuracy are discussed in the next section on Performance Evaluation.

5. Model Deployment:
        This is the most important step.  If the speed and accuracy of the model is acceptable, then that model should be deployed in the real system.  The model that is used in production should be made with all the available data. Models improve with the amount of available data used to create the model. The results of the model need to be incorporated in the business strategy.  Data mining models provide valuable information which give companies great advantages.  Obama won the election in part by incorporating the data mining results into his campaign strategy.   The last chapter of this book provides information in how a company can incorporate data mining results into its daily business.

Feb 21, 2013

Data Mining and Neuroscience

Bradley Voytek from University of California at SF gave a talk on data mining and neuroscience yesterday in Mountain View. This is part of Big Data Think Tank meetup.

Voytek said, data mining could play an important role in brain study and treatment. Imagine applying data mining to reduce open-skull surgery from 45 minutes to 2 minutes, imagine a better way to analyze and understand fMRI data. Imagine we can apply data mining to understand aging and brain response, helping us to identify ways to improve cognitive function for the elderly. In addition, not mentioned in this talk, are recent studies on applying data mining (specifically association rule mining) to understand Alzheimer’s disease, identifying associations in brain region change.

Voytek also mentioned the need to understand the network of neural signals. This is an exciting domain as it will ultimately help us to improve human cognition, such as hearing and vision. As reported recently in the news, the implementation of “bionic eyes” depends on the mapping of neurons corresponding to visual processing. Deeper understanding of neurons for this function could help us eradicate blindness completely. Imagine a future when there is no more blindness or deafness. How much human suffering can be eliminated!

Neuroscience is the next frontier of science. Understanding human brain and ultimately human consciousness will resolve age-old question on self and soul. With that understanding, imagine we can preserve consciousness or even human memory. It is not far-fetched to think we can watch another person’s memory like watching a movie, if we can truly understand those neural signals (associated memory retrieval).

It is exciting to see data mining can play an important role of in the advancement of science, particularly in neuroscience.

Feb 11, 2013

Data Mining vs. Machine Learning

Data mining and machine learning used to be two cousins. They have different parents. Now they grow increasingly like each other, almost like twins. Many times people even call data mining by the name Machine learning.

The field of machine learning grew out of the effort of building artificial intelligence. Its major concern is making a machine learn and adapt to new information. The origin of machine learning can be traced back to 1957 when the perceptron model was invented. This is modeled after neurons in human brain. That prompted the development of neural network model, which flourished in late 1980s. From 1980s to 1990s, the decision tree method has become very popular, owing to the efficient package of C4.5. SVM was invented in mid-1990s and it has since been widely used in industry. Logistic regression, an old method in statistics, has seen growing adoption in machine learning after 2001 when the book on statistical learning (The Elements of Statistical Learning) was published.

The field of data mining grows out of knowledge discovery from databases. In 1993, a seminal paper by Rakesh Agrawal and two others proposed an efficient algorithm of mining association rules in large databases. This paper promoted many research papers on discovering frequent patterns and more efficient mining algorithms. The early work of data mining in 1990s was linked to creating better SQL statement and working with databases directly.

Data mining has its strong focus on working with industrial problems and getting practical solutions. Therefore it concerns with not only data size (large data), but also data processing speed (stream data). In addition, personalized recommender systems and network mining are all developed due to business need, outside the machine learning field.  

The two major conferences for data mining are KDD (Knowledge Discovery and Data Mining) and ICDM (International Conference on Data Mining). The two major conferences for machine learning are ICML (International Conference on Machine Learning) and NIPS (Neural Information Processing Systems).  Machine learning researchers attend both types of conferences.However, the data mining conferences have much stronger industrial link.

Data Miners typically have strong foundation in machine learning, but also have a keen interesting in applying it large-scale problems.

Over time, we will see deeper connection between data mining and machine learning. Could they become twins one day? Only time will tell. 

Feb 10, 2013

The future of genome data mining

23andMe is a startup based in Mountain View, California. Founded in 2006, its core business is genome sequencing for individuals, and providing additional information on your ancestry and possible disease risk, which you can access on their website. 

The cost of sequencing a person’s genome used to be prohibitive. However, 23andMe with its deep pocket of venture and personal funding (Co-founder Ann Wojcicki is the wife of Google co-founder Sergey Brin), was able to cut the sequencing price from $999 in 2007 to $399 in 2008, then to $299 until end of 2012. In December 2012, with $50 Million Series D funding, 23andMe slashed the price to $99 per person. Such price is probably below their actual testing cost. Why the price cut? 23andMe states that their goal is to get 1 million people participate.

What is the drive behind the large expansion of the user base? The first potential is disease discovery. With a large population, a disease can be more solidly linked to genome data. Suppose we find gene mutation in 1 diabetes patient, it is not enough to conclude that the mutation caused her diabetes. However, if we find the same gene mutation in 1000 diabetes patients, we can be more confident to draw this conclusion. Ultimately it is getting a large enough size of population sample so that we can uniquely link a segment of the gene mutation or ancestral traits to a disease.

By December 2012 (before the price slash), 23andMe has accumulated 180,000 individual genome profiles [1]. So far, this is the largest dataset any one organization has accumulated on human genomes. Combined with the self-reported health profiles of these customers, studies of disease link to gene patterns can be done more conclusively.    

23andMe has partnered with Genentech to study a range of diseases from Alzhermer’s, to breast cancer, and (mostly recently) Avastin. In addition, the company received a small funding from NIH to study allergy and asthma. Given the large population data of genomes, we could see some exciting discovery.

Data mining will play a big role in these new discoveries. Note only data mining enables pattern discovery in a large data where there are many different diseases and persona traits, it can also create predictive models on disease onset related to person’s genome profile. The feature selection technique from data mining also has worked well on genome study where there are more than 20,000 gene features but only a few data points. Even with 1 million people in the data, the problem of small data points could still exist when only a small of group of people have similar diseases (Thus it is important to get even data from more people, ideally tens of millions or even billions).

The future of genome study is closely linked to data mining. This is an exciting time to be a data miner.

[1] 23andMe press release, “23andMe Raises More Than $50 Million in New Financing”, December 11, 2012.

Feb 9, 2013

Graph Mining

With the rapid rise of social network (through Facebook), professional network (through LinkedIn), and short-news report network (through Twitter), computer scientists have taken a keen look of network mining. In the terminology of computer science, a network is a graph. A graph is defined as a collection of nodes and edges. Therefore network mining is also called graph mining.

The most successful application of graph mining is web search, where the Internet is modeled as a network, and webpages are nodes. Each webpage is ranked based on their link strength.

Other applications of graph mining include:
1.    Understand molecule structure for drug discovery [1]

2.    Predict the spread of infectious diseases.

Any social network can be modeled as a graph, where nodes are people and edges are their relationship. On Facebook, the edges are "friend". On Twitter, the edges become “followed by”. On LinkedIn, the edges are “connection”. On eBay, the edges can be “sold to”.

How do we make use of the network structure? For marketers, finding out top influencers in a social network can be very useful. These people would influence a lot of people with their opinions.  Marketing can be much more effective by focusing on these influencers.

How do we discover influencers? In Facebook, these are people who have a lot of friends and whose postings get a lot of comments. In Twitter, these people have many followers and whose tweets are retweeted often. Note that it is possible for a top influencer to have a small number of friends (or followers), as long as these friends (or followers) are top influencers. Such a person could be a “king maker”, who directly influences the most powerful/influential politician.

Mining top influencers therefore involves an algorithm like PageRank, which is successfully used to discover top web pages. The essence of PageRank algorithm is recursive calculation of the weights on each link. This can be applied to calculating top influencers, where their influence strength can be recursively based on their followers' influence level.

Given 100 million people in a network, mining top influencers is a computational challenge. Fortunately such computing can be parallelized as each node can by calculated simultaneously.  This is how Google invented MapReduce, and how Hadoop came to be. Essentially, the so-called “Big Data” is about providing parallel computing infrastructure (such as Hadoop). Graph mining pioneered big data computing.

[1] Takigawa, Ichigaku, and Hiroshi Mamitsuka. "Graph mining: procedure, application to drug discovery and recent advances." Drug Discovery Today, Volume 18, Issues 1–2, January 2013, Pages 50–57

Jan 10, 2013

Smart TV and Data Mining

Smart TVs are coming. In this year's CES conference, the largest consumer electronics show each year in Las Vegas, every TV manufacturer is bragging about the "smartness" of their TV. All the TVs are now have Internet connectivity, have camera, display content from your smart phone, and play movies from youTube and Netflix. Even more, smart TVs have connection to Facebook, allow you to make Skype calls, and answers you voice command like Siri does on the iPhone.

The most interesting function for me is their capability of making movie recommendations.
 This means every TV has to remember and capture user’s viewing history, and possibly combine with the server’s knowledge about other viewers’ view data. All the information will be sent and stored in  TV maker's server. They will chunk the data, and build movie recommendation engine exactly like Netflix has done. 

This is a big advance for data mining. This is possible only after TV is connect to the Internet and store data in the cloud. Traditional TV does not have the computing power nor enough memory to do data mining. 

Now we are seeing data mining coming to our home. 

Jan 9, 2013

Product Recommendation by Amazon

Amazon is quite secretive about its technology. While researchers from other companies publish many papers on their approaches, we seldom see papers from Amazon. However, we can still infer about its technology by looking at their products in action. In this blog, we take a look at how Amazon makes recommendation to it users.

Users who purchase on Amazon typically get the following two "recommendations" at the bottom of product page: (1)"Frequently bought together" and (2) "Customers who bought this item also bought".

Strictly speaking, showing what are frequent bought together does not require complex recommender systems. This is a simple counting of frequent itemsets, a very fundamental technique taught on the first day of data mining class. This technique is also called Frequent Pattern Mining. The key challenge here is getting all of those "bought together" sets quickly from billions of transactions.

However, Amazon does provide more personalized recommendation, similar to what Netflix does. After you log in, there is a"recommended for you" page where Amazon's recommendation engine is in full action.

How does it Amazon make this recommendation? Based on a 2003 article published on IEEE Internet Computing (Jan/Feb issue), the company uses item-item similarity methods from collaborative filtering. At that time, this was state-of-the-art method, and Amazon was pioneering the field of recommender system.

It seems this same implementation has been in use in Amazon until today. As we can see from the picture to the left. The snapshot was taken in January 2013.  The first product (Girl’s 7-16 Jacket) is recommended because the user purchased a somewhat similar item (Girls 2-6x Princess Jacket). Similar thing is true for the second recommended product. In other words, item-item similarity is a major technology used by Amazon's recommendation engine.  

The field of recommender systems have seen great advance since 2006's Netflix Prize contest. From 2006 until 2009 when the prize was awarded,  many methods have been invented to tackle recommendation problem. Among them, the most widely adopted method today is Matrix Factorization (with SVD as a special implementation). It was shown that this method generated better results than item-item similarity approach. Netflix adopted matrix factorization method after 2007, and has been using it in its production system until today. 

Both item-item similarity and matrix factorization approach have been eclipsed by other approaches in the last 2 years. Netflix itself has moved into a machine-learning based ranking model, and others (such as getJar, see an early blog) have explored neighborhood-based methods.

Would Amazon adopt more sophisticated methods to make its recommendation? Given its business is doing so well with simple methods, this probably will not happen soon.

Jan 3, 2013

Applications of Supervised learning

Machine learning, specifically supervised learning (which is used equivalent to classification), has become so versatile that it can be applied to a wide range of situations. On the surface, binary classification does not sound that interesting -- It only gives a “yes” or “no” answer.  But it gains power when you can associate a probability with each “yes” or “no” answer. With probability, you can score people based on their likelihood of buying, likelihood of defaulting, or likelihood of churning.

Thus machine learning becomes truly powerful when it is “statistical learning”, where our prediction of “yes” or “no” is associated with a probability (a number ranging from 0 to 1, which can be scaled to generate a score). 

Here are some applications we can apply binary classification to:

1. Click prediction, which can be applied to (1) search ranking (2) online ads ranking
     If you can rank the probability of user click, then you can present the search results based on that probability. The documents or products what have high probably of being clicked will ranked higher. Similarly, when we decide which display ads toshow to the user, we can use click probability. 

2. Fraud detection
     Our task is deciding whether a transaction is fraud. This is a simple binary classification: fraud or not? When the probability crosses a threshold (say 0.8), we can classify a transaction as a fraud.

3. Select sales prospects to call based on their probability of responding.
    This is also a simple binary classification problem: Predict whether a prospect will respond to not.

Additional applications of binary classification are: 
4.   Accept or reject an application for  loan or credit card, or an insurance claim
5.    Customer churn
6.    Most valuable customer
7.    Cancer detection
8.    Quality control: Car is good to go out or not
9.    Product category classification
10.    User type classification
11.    Sentiment classification in social media: Positive or negative? 
12.    Document classification (applied in legal search, LinkedIn recommendation, and web search) 

Thus data mining field is now dominated by methodology of machine learning. Many people equate data mining to machine learning. This is not surprising given the wide applications we see.