Pages

Apr 30, 2013

The future of video mining


IP camera has become inexpensive and ubiquitous. For $149, you can get an IP camera to use at home. This camera saves recording in the cloud, and you never have to worry about buying memory cards. Dropcam offers this amazing service. The camera connects to your wi-fi network and stream real-time video 24 hours a day to the server. You can watch your home remotely from your smart phone or laptop.

Dropcam is a startup based in San Francisco. Founded in 2009, the company has raised 2 rounds of funding, with $12 million of Series B in June 2012. As a young startup, the company has seen rapid user growth. Dropcam CEO Greg Duffy said that Dropcam cameras now upload more video per day than YouTube.

What makes Dropcam unique from other IP camera company is its video mining capability. If you want to review your video of the last 7 days, you don’t have to watching them from the beginning to the end. Dropcam automatic mark the video segment with motion, so that you can jump to those segments right away.

In addition, Dropcam plans to implement face and figure detection so that it can give more intelligent viewing options. Furthermore, the video view software can detect events such as cat running, dinner gathering and so on. The potential of video mining is limitless. Given our limited time to review videos of many hours (or days and weeks) and continuing accumulation of our daily recording, the need for video mining will keep growing.

Apr 29, 2013

Stroke Prediction

Stroke is the third leading cause of death in the United States. It is also the principal cause of serious long-term disability. Stroke risk prediction can contribute significantly to its prevention and early treatment. Numerous medical studies and data analyses have been conducted to identify effective predictors of stroke.

Traditional studies adopted features (risk factors) that are verified  by  clinical trials  or  selected manually by medical experts. For example, one famous study by Lumley and others[1] built a 5-year  stroke prediction model using a set of 16 manually selected features. However, these manually selected features could miss some important indicators. For example,  past studies  have shown that there exist additional risk factors  for stroke such  as  creatinine level,  time  to  walk  15 feet,  and others.

The Framingham Study [2] surveyed a wide range of stroke risk factors including blood pressure, the use of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease, and atrial fibrillation. With  a large  number of features in current medical  datasets, it  is a  cumbersome task to  identify  and  verify  each  risk  factor  manually.  Machine learning algorithms are capable of identifying features highly related to stroke occurrence efficiently from the huge set of features. By doing so, it can improve the prediction accuracy of stroke risk, in addition to discover new risk factors.

In a study by Khosla and others [3], a machine-learning based predictive model was built on stroke data, and several feature selection methods were investigated. Their model was based on automatically selected features. It outperformed existing stroke model. In addition, they were able to identify risk factors that have not been discovered by traditional approaches. The newly identified factors include:

Total medications
Any ECG  abnormality
Min.  ankle  arm  ratio
Maximal  inflation level
Calculated 100 point score
General  health
Minimental score 35 point

It’s exciting to see machine learning play a more important role in medicine and health management.

Reference:

[1]  T.  Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February 2002.
[2] P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: A risk profile from the Framingham study. Stroke, 22:312{318, March 1991.
[3] Aditya Khosla, Yu Cao, Cliff Chiung-Yu Lin, Hsu-Kuang Chiu, Junling Hu, and Honglak Lee. "An integrated machine learning approach to stroke prediction." In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183-192. ACM, 2010.

Apr 26, 2013

History of machine learning

The development of machine learning is an integral part of the development of artificial intelligence. In the early days of AI, people were interested in building machines that mimic human brains. The perceptron model was invented in 1957, and it generated over optimistic view for AI during 1960s. After Marvin Minsky pointed out the limitation of this model in expressing complex functions, researchers stopped pursuing this model for the next decade. .

In 1970s, the machine learning field was dormant, when expert systems became the mainstream approach in AI.  The revival of machine learning came in mid-1980s, when the decision tree model was invented and distributed as software. The model can be viewed by a human and is easy to explain. It is also very versatile and can adapt to widely different problems. It is also in mid 1980s multi-layer neural networks were invented, With enough hidden layers, a neural network can express any function, thus overcoming the limitation of perceptron. We see a revival of the neural network study.


Both decisions trees and neural networks see wide application in financial applications such as loan approval, fraud detection and portfolio management. They are also applied to a wide-range of industrial process and postal office automation (address recognition). 
Machine learning saw rapid growth in 1990s, due to the  invention of World-Wide-Web and large data gathered on the Internet. The fast interaction on the Intern called for more automation and more adaptivity in computer systems. Around 1995, SVM was proposed and have become quickly adopted. SVM packages like libSVM, SVM light make it a popular method to use. 

After year 2000, Logistic regression was rediscovered and re-designed for large scale machine learning problems . In the ten years following 2003, logistic regression has attracted a lot of research work and has become a practical algorithm in many large-scale commercial systems, particularly in large Internet companies.  

We discussed the development of 4 major machine learning methods. There are other method developed in parallel, but see declining use today in the machine field: Naive Bayes, Bayesian networks, and Maximum Entropy classifier (most used in natural language processing). 

In addition to the individual methods, we have seen the invention of ensemble learning, where several classifiers are used together, and its wide adoption today. 

New machine learning methods are still invented each day. For the newest development, please check out the annual ICML (International Conference on Machine Learning) conference. 

Apr 25, 2013

Yelp and Big data


In the Big Data Gurus meetup yesterday, hosted in Samsung R&D center in San Jose, Jimmy Retzlaff from Yelp gave a talk on Big data at Yelp.

By the end of March 2013, Yelp has 36 million user reviews. These reviews cover from restaurants to hair salon, and other local businesses. The number of reviews on Yelp website has grown exponentially in the last few years.

Yelp also sees high traffic now. In January 2013, there are 100 million unique visitors to Yelp. The website records 2 terabytes of log data and another 2 terabytes of derived log every day. While this data size is still small comparing to eBay or LinkedIn, it calls for implementation of big data infrastructure and data mining methods.

Yelp uses MapReduce extensively and builds its infrastructure on Amazon cloud.

Yelp’s log data contain ad display, user clicks and so on. Data mining helps Yelp in designing search system, showing ads, and filter fake reviews. In addition, data mining enables products such as "review highlights", "people who viewed this also viewed...".

Yelp is one example of companies starting to tackle big data and taking advantage of data mining for creating better services.

Apr 24, 2013

Machine Learning for Anti-virus Software


Symantec is the largest anti-virus software vendor. It has 120 million subscribers, who visit 2 billion websites a day and generate 700 billion submissions. Given such a large number of data, it is paramount that an anti-virus software can detect the virus fast and accurately.

Anti-virus software was originally built manually. Security expert review each malware and construct their “signature”. Each computer file is checked against such signatures. Given the rapid change of malware and many variations, there are not enough human experts to generate all the exact signatures. This gives rise to heuristic or generic signatures which can handle more variations of the same file. However, new types of malware are created every day. Thus we need a more adaptive approach to identify malware automatically (without manual effort of creating signatures). This is where machine learning can help.

Computer virus has come a long way. The first virus “creeper” appeared in 1971. Then we have Rabbit or Wabbit. After that came computer worms like “Love Letter” and Nimda. Today computer virus gets much more sophisticated. It evolves much faster and is constantly changing. Virus creation is now funded by organizations and some governments. There is big incentive to steal user financial information or companies’ trade secrets. In addition, malware enables certain governments to conduct spying or potential cyber war on their targets.

Symantec uses about 500 features for their machine learning model. The feature value can be continuous or discrete. Such features include:
How did it come this machine (through browser, email, ..)
When/were
How many other files on this machine?
How many clean files on this machine?
Is file packed or obfuscated? (mutated?)
Does it write, communicate?
How often does it run?
Who runs it?

Researchers at Symantec experiment with SVM, decision tree and linear regression models.

In building a classifier, they are not simply optimizing accuracy or true positive rate. They are also concerned false positive instances where a benign software was classified as malware. Such false positive prediction could have high cost for the users. The balance of true positive vs. false positive leads to using ROC (Receiver Operating Characteristic) curve.

An ROC curve plots the trade-off between true positive rate vs. false positive rate. Each point on the curve corresponds to a cutoff we choose. They use ROC curve to select a target point. Below is an illustration of the tradeoff.

The chart above suggests that when we aim for 90% true positive rate, we will have 20% false positive rate. However, when we only aim for 80% true positive rate, the false positive rate be reduced to 20%. (A better classifier could shift the ROC curve up, so that we achieve high true positive rate for any given false positive rate.)

According their researcher, Symantec has achieved high accuracy rate (the average of True positive and true negative rate) at 95%. Its true positive rate is above 98% and its false positive rate is below 1%.

I am a user of Norton software (by Symantec) and enjoy it. I hope to see more success from Symantec and we are winning the war against malware!

Mar 12, 2013

Basic steps of applying machine learning methods


Deploying a machine learning model typically takes the following five steps:

1. Data collection.
2. Data preprocessing:
    1) Data cleaning;
    2) Data transformation;
    3) Divide data into training and testing sets.
3. Build a model on training data.
4. Evaluate the model on the test data.
5. If the performance is satisfying, deploy to the real system.

This process can be iterative, meaning we can re-start from step 1 again. For example, after a model is deployed, we can collect new data and repeat this process. Let’s look at the details of each step:

1. Data Collection:  
        At this stage, we want to collect all relevant data. For an online business, user click, search queries, and browsing information should be all be captured and saved into the database.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.

2. Data Preprocessing:
       The data used in Machine Learning describes factors, attributes, or features of an observation.  Simple first steps in looking at the data include finding missing values.  What is the significance of that missing value?   Would replacing a missing data value with the median value for the feature be acceptable? For example, perhaps the person filling out a questionnaire doesn't want to reveal his salary.  This could be because the person has a very low salary or a very high salary.  In this case, perhaps using other features to predict the missing salary data might be appropriate.  One might infer the salary from the person’s zip code.  The fact that the value is missing may be important.  There are machine learning methods that ignore missing values and one of these could be used for this data set.
          2) Data Transformation: 
          In general we work with both numerical and categorical data.  Numerical data consists of actual   numbers, while categorical data have a few discrete values. Examples of categorical data include eye color, species type, marriage status, or gender.  Actually a zip code is categorical.  The zip code is a number but there is no meaning to adding two zip codes.  There may or may not be an order to categorical data.  For instance good, better, best is descriptive categorical data which has an order.

3) After the data has been cleaned and transformed it needs to be split into a training set and a test set.

3. Model Building:  
        This training data set is used to create the model which is used to predict the answers for new cases in which the answer or target is unknown.   For example, Section 1.3 describes how a decision tree is built using the training data set.  Several different modeling techniques have been introduced and will be discussed in detail in future sections.  Various models can be built using the same training data set.

4. Model Evaluation
         Once the model is built with the training data, it is used to predict the targets for the test data.  First the target values are removed from the test data set.  The model is applied to the test data set to predict the target values for the test data.  The predicted value of the target is then compared with the actual target value.  The accuracy of the model is the percentage of correct predictions made.  These accuracies of can be used to compare the different models.  Several other ways to compare model accuracy are discussed in the next section on Performance Evaluation.

5. Model Deployment:
        This is the most important step.  If the speed and accuracy of the model is acceptable, then that model should be deployed in the real system.  The model that is used in production should be made with all the available data. Models improve with the amount of available data used to create the model. The results of the model need to be incorporated in the business strategy.  Data mining models provide valuable information which give companies great advantages.  Obama won the election in part by incorporating the data mining results into his campaign strategy.   The last chapter of this book provides information in how a company can incorporate data mining results into its daily business.

Feb 21, 2013

Data Mining and Neuroscience


Bradley Voytek from University of California at SF gave a talk on data mining and neuroscience yesterday in Mountain View. This is part of Big Data Think Tank meetup.

Voytek said, data mining could play an important role in brain study and treatment. Imagine applying data mining to reduce open-skull surgery from 45 minutes to 2 minutes, imagine a better way to analyze and understand fMRI data. Imagine we can apply data mining to understand aging and brain response, helping us to identify ways to improve cognitive function for the elderly. In addition, not mentioned in this talk, are recent studies on applying data mining (specifically association rule mining) to understand Alzheimer’s disease, identifying associations in brain region change.

Voytek also mentioned the need to understand the network of neural signals. This is an exciting domain as it will ultimately help us to improve human cognition, such as hearing and vision. As reported recently in the news, the implementation of “bionic eyes” depends on the mapping of neurons corresponding to visual processing. Deeper understanding of neurons for this function could help us eradicate blindness completely. Imagine a future when there is no more blindness or deafness. How much human suffering can be eliminated!

Neuroscience is the next frontier of science. Understanding human brain and ultimately human consciousness will resolve age-old question on self and soul. With that understanding, imagine we can preserve consciousness or even human memory. It is not far-fetched to think we can watch another person’s memory like watching a movie, if we can truly understand those neural signals (associated memory retrieval).

It is exciting to see data mining can play an important role of in the advancement of science, particularly in neuroscience.

Feb 11, 2013

Data Mining vs. Machine Learning

Data mining and machine learning used to be two cousins. They have different parents. Now they grow increasingly like each other, almost like twins. Many times people even call data mining by the name Machine learning.

The field of machine learning grew out of the effort of building artificial intelligence. Its major concern is making a machine learn and adapt to new information. The origin of machine learning can be traced back to 1957 when the perceptron model was invented. This is modeled after neurons in human brain. That prompted the development of neural network model, which flourished in late 1980s. From 1980s to 1990s, the decision tree method has become very popular, owing to the efficient package of C4.5. SVM was invented in mid-1990s and it has since been widely used in industry. Logistic regression, an old method in statistics, has seen growing adoption in machine learning after 2001 when the book on statistical learning (The Elements of Statistical Learning) was published.

The field of data mining grows out of knowledge discovery from databases. In 1993, a seminal paper by Rakesh Agrawal and two others proposed an efficient algorithm of mining association rules in large databases. This paper promoted many research papers on discovering frequent patterns and more efficient mining algorithms. The early work of data mining in 1990s was linked to creating better SQL statement and working with databases directly.

Data mining has its strong focus on working with industrial problems and getting practical solutions. Therefore it concerns with not only data size (large data), but also data processing speed (stream data). In addition, personalized recommender systems and network mining are all developed due to business need, outside the machine learning field.  

The two major conferences for data mining are KDD (Knowledge Discovery and Data Mining) and ICDM (International Conference on Data Mining). The two major conferences for machine learning are ICML (International Conference on Machine Learning) and NIPS (Neural Information Processing Systems).  Machine learning researchers attend both types of conferences.However, the data mining conferences have much stronger industrial link.

Data Miners typically have strong foundation in machine learning, but also have a keen interesting in applying it large-scale problems.

Over time, we will see deeper connection between data mining and machine learning. Could they become twins one day? Only time will tell. 

Feb 10, 2013

The future of genome data mining


23andMe is a startup based in Mountain View, California. Founded in 2006, its core business is genome sequencing for individuals, and providing additional information on your ancestry and possible disease risk, which you can access on their website. 

The cost of sequencing a person’s genome used to be prohibitive. However, 23andMe with its deep pocket of venture and personal funding (Co-founder Ann Wojcicki is the wife of Google co-founder Sergey Brin), was able to cut the sequencing price from $999 in 2007 to $399 in 2008, then to $299 until end of 2012. In December 2012, with $50 Million Series D funding, 23andMe slashed the price to $99 per person. Such price is probably below their actual testing cost. Why the price cut? 23andMe states that their goal is to get 1 million people participate.

What is the drive behind the large expansion of the user base? The first potential is disease discovery. With a large population, a disease can be more solidly linked to genome data. Suppose we find gene mutation in 1 diabetes patient, it is not enough to conclude that the mutation caused her diabetes. However, if we find the same gene mutation in 1000 diabetes patients, we can be more confident to draw this conclusion. Ultimately it is getting a large enough size of population sample so that we can uniquely link a segment of the gene mutation or ancestral traits to a disease.

By December 2012 (before the price slash), 23andMe has accumulated 180,000 individual genome profiles [1]. So far, this is the largest dataset any one organization has accumulated on human genomes. Combined with the self-reported health profiles of these customers, studies of disease link to gene patterns can be done more conclusively.    

23andMe has partnered with Genentech to study a range of diseases from Alzhermer’s, to breast cancer, and (mostly recently) Avastin. In addition, the company received a small funding from NIH to study allergy and asthma. Given the large population data of genomes, we could see some exciting discovery.

Data mining will play a big role in these new discoveries. Note only data mining enables pattern discovery in a large data where there are many different diseases and persona traits, it can also create predictive models on disease onset related to person’s genome profile. The feature selection technique from data mining also has worked well on genome study where there are more than 20,000 gene features but only a few data points. Even with 1 million people in the data, the problem of small data points could still exist when only a small of group of people have similar diseases (Thus it is important to get even data from more people, ideally tens of millions or even billions).

The future of genome study is closely linked to data mining. This is an exciting time to be a data miner.

Reference:
[1] 23andMe press release, “23andMe Raises More Than $50 Million in New Financing”, December 11, 2012. http://mediacenter.23andme.com/press-releases/23andme-raises-more-than-50-million-in-new-financing/

Feb 9, 2013

Graph Mining

With the rapid rise of social network (through Facebook), professional network (through LinkedIn), and short-news report network (through Twitter), computer scientists have taken a keen look of network mining. In the terminology of computer science, a network is a graph. A graph is defined as a collection of nodes and edges. Therefore network mining is also called graph mining.

The most successful application of graph mining is web search, where the Internet is modeled as a network, and webpages are nodes. Each webpage is ranked based on their link strength.

Other applications of graph mining include:
1.    Understand molecule structure for drug discovery [1]

2.    Predict the spread of infectious diseases.

Any social network can be modeled as a graph, where nodes are people and edges are their relationship. On Facebook, the edges are "friend". On Twitter, the edges become “followed by”. On LinkedIn, the edges are “connection”. On eBay, the edges can be “sold to”.


How do we make use of the network structure? For marketers, finding out top influencers in a social network can be very useful. These people would influence a lot of people with their opinions.  Marketing can be much more effective by focusing on these influencers.

How do we discover influencers? In Facebook, these are people who have a lot of friends and whose postings get a lot of comments. In Twitter, these people have many followers and whose tweets are retweeted often. Note that it is possible for a top influencer to have a small number of friends (or followers), as long as these friends (or followers) are top influencers. Such a person could be a “king maker”, who directly influences the most powerful/influential politician.

Mining top influencers therefore involves an algorithm like PageRank, which is successfully used to discover top web pages. The essence of PageRank algorithm is recursive calculation of the weights on each link. This can be applied to calculating top influencers, where their influence strength can be recursively based on their followers' influence level.

Given 100 million people in a network, mining top influencers is a computational challenge. Fortunately such computing can be parallelized as each node can by calculated simultaneously.  This is how Google invented MapReduce, and how Hadoop came to be. Essentially, the so-called “Big Data” is about providing parallel computing infrastructure (such as Hadoop). Graph mining pioneered big data computing.

Ref:
[1] Takigawa, Ichigaku, and Hiroshi Mamitsuka. "Graph mining: procedure, application to drug discovery and recent advances." Drug Discovery Today, Volume 18, Issues 1–2, January 2013, Pages 50–57

Jan 10, 2013

Smart TV and Data Mining

Smart TVs are coming. In this year's CES conference, the largest consumer electronics show each year in Las Vegas, every TV manufacturer is bragging about the "smartness" of their TV. All the TVs are now have Internet connectivity, have camera, display content from your smart phone, and play movies from youTube and Netflix. Even more, smart TVs have connection to Facebook, allow you to make Skype calls, and answers you voice command like Siri does on the iPhone.

The most interesting function for me is their capability of making movie recommendations.
 This means every TV has to remember and capture user’s viewing history, and possibly combine with the server’s knowledge about other viewers’ view data. All the information will be sent and stored in  TV maker's server. They will chunk the data, and build movie recommendation engine exactly like Netflix has done. 

This is a big advance for data mining. This is possible only after TV is connect to the Internet and store data in the cloud. Traditional TV does not have the computing power nor enough memory to do data mining. 

Now we are seeing data mining coming to our home. 

Jan 9, 2013

Product Recommendation by Amazon

Amazon is quite secretive about its technology. While researchers from other companies publish many papers on their approaches, we seldom see papers from Amazon. However, we can still infer about its technology by looking at their products in action. In this blog, we take a look at how Amazon makes recommendation to it users.

Users who purchase on Amazon typically get the following two "recommendations" at the bottom of product page: (1)"Frequently bought together" and (2) "Customers who bought this item also bought".

Strictly speaking, showing what are frequent bought together does not require complex recommender systems. This is a simple counting of frequent itemsets, a very fundamental technique taught on the first day of data mining class. This technique is also called Frequent Pattern Mining. The key challenge here is getting all of those "bought together" sets quickly from billions of transactions.

However, Amazon does provide more personalized recommendation, similar to what Netflix does. After you log in, there is a"recommended for you" page where Amazon's recommendation engine is in full action.

How does it Amazon make this recommendation? Based on a 2003 article published on IEEE Internet Computing (Jan/Feb issue), the company uses item-item similarity methods from collaborative filtering. At that time, this was state-of-the-art method, and Amazon was pioneering the field of recommender system.

It seems this same implementation has been in use in Amazon until today. As we can see from the picture to the left. The snapshot was taken in January 2013.  The first product (Girl’s 7-16 Jacket) is recommended because the user purchased a somewhat similar item (Girls 2-6x Princess Jacket). Similar thing is true for the second recommended product. In other words, item-item similarity is a major technology used by Amazon's recommendation engine.  

The field of recommender systems have seen great advance since 2006's Netflix Prize contest. From 2006 until 2009 when the prize was awarded,  many methods have been invented to tackle recommendation problem. Among them, the most widely adopted method today is Matrix Factorization (with SVD as a special implementation). It was shown that this method generated better results than item-item similarity approach. Netflix adopted matrix factorization method after 2007, and has been using it in its production system until today. 

Both item-item similarity and matrix factorization approach have been eclipsed by other approaches in the last 2 years. Netflix itself has moved into a machine-learning based ranking model, and others (such as getJar, see an early blog) have explored neighborhood-based methods.

Would Amazon adopt more sophisticated methods to make its recommendation? Given its business is doing so well with simple methods, this probably will not happen soon.

Jan 3, 2013

Applications of Supervised learning

Machine learning, specifically supervised learning (which is used equivalent to classification), has become so versatile that it can be applied to a wide range of situations. On the surface, binary classification does not sound that interesting -- It only gives a “yes” or “no” answer.  But it gains power when you can associate a probability with each “yes” or “no” answer. With probability, you can score people based on their likelihood of buying, likelihood of defaulting, or likelihood of churning.

Thus machine learning becomes truly powerful when it is “statistical learning”, where our prediction of “yes” or “no” is associated with a probability (a number ranging from 0 to 1, which can be scaled to generate a score). 

Here are some applications we can apply binary classification to:

1. Click prediction, which can be applied to (1) search ranking (2) online ads ranking
     If you can rank the probability of user click, then you can present the search results based on that probability. The documents or products what have high probably of being clicked will ranked higher. Similarly, when we decide which display ads toshow to the user, we can use click probability. 

2. Fraud detection
     Our task is deciding whether a transaction is fraud. This is a simple binary classification: fraud or not? When the probability crosses a threshold (say 0.8), we can classify a transaction as a fraud.

3. Select sales prospects to call based on their probability of responding.
    This is also a simple binary classification problem: Predict whether a prospect will respond to not.

Additional applications of binary classification are: 
4.   Accept or reject an application for  loan or credit card, or an insurance claim
5.    Customer churn
6.    Most valuable customer
7.    Cancer detection
8.    Quality control: Car is good to go out or not
9.    Product category classification
10.    User type classification
11.    Sentiment classification in social media: Positive or negative? 
12.    Document classification (applied in legal search, LinkedIn recommendation, and web search) 

Thus data mining field is now dominated by methodology of machine learning. Many people equate data mining to machine learning. This is not surprising given the wide applications we see.

Dec 30, 2012

Natural Language Processing in patent litigation: Lex Machina

It is always exciting to see Natural Language Processing being put in practical use. In a previous blog, I mentioned about a startup using NLP for business products: Radius Intelligence (extracting business names and locations).

Today I want to introduce a 2nd company in this domain: Lex Machina. This one is much closer to home (I mean, where I live). It is based in Palo Alto, founded by a couple of Stanford professors and students, one from Law school and one from computer science department. Their product is patent litigation analysis, completely automated.

Given a lot of patent lawsuits around this country, the company creates a daily index and searchable database on major lawsuits. The value of tracing patent lawsuit is apparent to many large companies, as we can tell from Apple and Samsung’s fierce fight. Back in late 1990s, eBay lost a patent litigation on an obscure auction algorithm written by unknown person, even after spending a large sum on lawyers and expert witnesses. Not to mention Google acquitted Motorola mainly for its large patent portfolio.

Lex Machina has a web crawler that crawls major court case websites and bring back the document and index them. I can see 2 major techniques from NLP used here: (1) Named Entity Recognition (2) Text classification (with supervised learning).

Beyond these techniques, this company also converts the documents by optical character recognition (OCR) to searchable text and stores each one as a PDF file. In addition, the crawl is done on a few major litigation websites, making the problem tractable.

I look forward to seeing the succcess of Lex Machina. 

Dec 20, 2012

How 'Big Data" became the word of the year

In today's NPR Fresh Air program, linguist Geoffrey Nunberg chose "Big Data" as the word of the year (listen to this podcast).  I personally agree.

Data mining was a big part of Obama campaign's victory secret this year, but also the word is also mentioned widely in many publications.

Back in Feb 2010, Economist ran a cover story on The Data Deluge. It essentially said "Data, data everywhere".

But the term "big data" started to appear only on this year. In New York Times' article on February 11, 2012,  "The Age of Big Data", we start to see the term get used. New York Times then had a follow-up article in August titled "How Big Data Became So Big".

Harvard Business Review had its October 2012 issue devoted to "Big Data", discussing its impact on management revolution to career choice as data scientists.

On November 18,  2 weeks after the election, Wall Street Journal  had an article on Obama's 'Big Data' Victory.  

Indeed, "Big Data" has become a buzz word, and I vote for it to be the Word of the Year too!


(By the way, I like Numberg's comment: "In the old days nobody knows you are a dog on the Internet. Now everybody knows what dog food you buy.")

Dec 18, 2012

User Click Prediction

Click prediction has been well studied in online advertisement and sponsored search domains. The probability of clicking on a link can be modeled with a statistical classifier. Such a classifier assigns probabilities to two possible outcomes: click and no click. Thus the problem can be transformed into training a good binary classifier that predicts the likelihood of clicks, and the training can be done
with existing query log data.

There are two ways of improving a classifier: One is selecting the most appropriate machine learning algorithm, the other is improving features for the classi er.

People have experimented with different machine learning algorithms for click prediction, including ranking SVM, binary SVM, probit regression, decision trees, and gradient boosted decision trees. We use logistic regression more often. This algorithm is fast and performs as well as SVM. In addition, it is  less prone to over fitting.

The second way to improve a classi fier's performance is using good features. For this purpose, people tried various features for click prediction model. The most important feature is historical Click Through Rate (CTR) when such data exist. However, on many websites, new items or documents are introduced daily. They don't have click history. In this case, many other features can be used.

Two major types of features are item features and user features. Item features include information of a document or a product. For example, an item feature for a product can be price and posting date etc. User features include both demographic information and behavior information. Behavior information is user browsing behavior such as dwell time and query reformulation. In addition, we can use text features from user query.

Dec 17, 2012

BlackLotus and Product Assortment

It is reported in TechCrunch today that Home Depot acquired startup BlackLotus. BlackLotus applies data mining to two major problems facing retail stores: Pricing and Product Assortment. The Pricing solution got a lot of mention in this article. I want dive deeper on the second solution provided by BlackLotus: Product assortment.

Since the early days of data mining (which started in 1993, with the publication of Rakesh Agarsal's seminal paper on frequent pattern mining), researchers has been paying attention to product selection problem. The famous "diaper and beer" story illustrates how frequent pattern mining discovers young fathers who shop for diaper also shop for beer on Friday nights. The purchase pattern of users on an assortment of products give retailers hint on what to display together. It also gives retailers a tool to decide what to replenish when products are sold out.

Each retailer faces the problem of limited shelf space. Take Home Depot as an example. If they decide to replenish oriental rug, but find that this kind of rug is bought together with long dinning table. Then they should consider increasing the stock of those long dinning tables. Given the limited shelf space, they may have to decrease the stock of contemporary rugs if that rug does not sell well. Thus the decision has to be made between two product assortments: {oriental rug, long dinning table} and {contemporary rug}. This decision considers the frequency of purchase of each assortment, and the profit each sale brings.

The data mining approach to Frequent Pattern Mining gives us a fast and efficient way to calculate frequency of any product. The transaction table is recorded as in Table 1, where each transaction correspond to each row, and the columns are the products. One person can have multiple rows if that person comes back to shop again. In this table, we assume that are total 9 million purchases (in a month) in Home Depot.
Transaction ID
Oriental Rug
Contemporary
Rug
Dinning table
  1
1
0
  ...
1
  2
0
1
  ...
0
  3
1
0
  ...
0
  ...
 ...
  ...
  ...

  9,000,000
1
0
  ...
1
                             
                                    Table 1. Transaction table

The data mining field has invented 2 major algorithms to calculate the purchase frequency of any product set, or itemset (a set of product items): Apriori and FP-tree (Frequent-pattern tree). The first is a breadth first search method, and the second is essentially a depth-first search.

We will talk about the details of these algorithms in the future if time allows. 

Dec 16, 2012

Major Data Mining Conferences in 2013

Here is a list of major data mining conferences in 2013.The one I highly recommend is KDD, as it represents the best Data Mining research from both industry and academic.The site for 2013 is not yet complete, but you can take a loot at KDD-2012 to get a flavor of its programs.

What's interesting is the general data mining conferences are all in the US for this year (2013), but sub-area conferences are in interesting locations like Brazil, France, or Bulgaria.

Academic+industry
  • SDM (SIAM Conference on Data Mining),  May 2-4, Austin, Texas
  • KDD (Knowledge Discover and Data Mining), Aug 11-14, Chicago, Illinois
  • ICDM (International conference on Data Mining), Dec 8-11, Dallas, Texas

 Industry

Academic
  • ICML (machine learning), June 16-21, Atlanta, Georgia
  • AAAI (Artificial Intelligence), July 14-18, Bellevue, Washington

 Specialized area
  • WWW  (web data, text mining), May 13-17, Rio de Janeiro, Brazil
  • SIGIR  (text mining), July 28-31, Dublin, Ireland
  • SIGGraph  (Image/video mining), July 21-25, Anaheim, California
  • ACL (Natural Language processing), Aug 4-9, Sofia, Bulgaria
  • Interspeech (Speech mining), Aug 25-29, Lyon, France
  • Recsys (Recommender system), Oct 12-16, Hong Kong

Dec 15, 2012

Clustering – An unsupervised learning method

Typical machine learning has well-defined outcomes: A review is either positive or negative, a mammogram scan either shows breast cancer or not. With unsupervised learning, we don’t know the outcome. What we can do is simply grouping similar items together, and form clusters. This is so-called clustering method. 

Clustering does not require training or human labeling on training data, thus it is a very convenient tool to use when we have a large number of new data, which may be fundamentally different from old data. This happens on the Internet domain, where new documents appear daily. It also happens with E-commerce domain, where new products appear weekly or monthly.

Clustering gives us freedom to form groups without prior knowledge. Such freedom also presents problems: How many clusters do we form? How do we define similarity? Once clusters are formed, how do we verify their correctness?

We can arbitrarily decide on the number of clusters beforehand. Let’s call this number k. To start clustering, we randomly pick k data points and assume they are the centers of k clusters (called centroid). Then we calculate the distance of other data points to each of these centers. Each of these data points are assigned to the neighborhood of the centroid that it is closest to. Once all data points are assigned, we will have k clusters. In round 2, we re-calculate the centroid of each cluster (Mathematically it is the mean of all the data points in a cluster), and re-assign all data points to each of these clusters. In round 3, we will repeat the process. The process will end when all clusters are the same as the last round. 

Let's look at a simple example. Suppose we have 6 people's income as the following: 1, 2, 100, 101, 1000, 1050. Suppose we believe there are 3 possible clusters in this data (k=3). To start, we arbitrarily pick 3 data points as the centers of clusters. Let these initial k centroids be: 1, 100, 101.  The remaining 3 points are then assigned to the clusters of {1}, {100} and {1050}. Based on the shortest distance, we end up with {1, 2}, {100}, {101, 1000, 1050}. In Round 2, our new centroids are 1.5, 100 and 717. Based on these centroids, our new clusters are {1, 2}, {100, 101}, {1000, 1050}. In round 3, the new centroids are 1,5, 100.5, 1025. The new clusters are {1, 2}, {100, 101}, {1000, 1050}. They are the same as the last round, and we have found our clusters. 

The about clustering process is called K-means clustering. It is the most popular method for doing clustering. Another popular method is called Expectation Maximization (EM). We may discuss it in the future blog, if there is opportunity (when no other interesting topic pops up). 

Dec 14, 2012

10 Startups providing cutting-edge Data Mining products

Data mining has matured to the point that new business can be created around it. In bay area, exciting startups have popped up to provide cutting-edge solution on audience targeting, email security, and mobile applications.

1. GetJar   
     Mobile app distributor and virtual currency for user to enjoy premium app, 
     Technique: Recommender system

2. Ooyala
      Provide Video Analytics and personalized video Ad targeting, understand viewer behavior
     Technique: Reinforcement learning, recommender system

3. Rocket Fuel
     Help clients to create targeted advertising campaign
     Technique: Machine learning, big data

4.    Turn
      Targeted advertising, Mining data for digital advertising
     Technique: Machine learning, big data

5.    BlueKai
      Online User tracking and analytics for marketing                             
      Technique: Clustering

6.    Quantcast
      Real-time audience targeting, online bidding   
      Technique: Machine learning

7.   Tellapart      
      Real time bidding on behavior of clients, audience targeting on display ads, with scoring on customers likelihood of viewing
      Technique: Machine learning, Regression

8.    Reputation.com   
       Web data mining on all information about a person or company, create targeted ranking solution to represent your reputation               
       Technique: Predictive modeling

9. Radius Intelligence
      Web data mining for lead generation of local sales (Mining location, business revenue, etc),             
      Technique: Natural language processing  

10. ProofPoint
     Data security as a service, focusing on email security 
     Technique: Machine learning, big data

Dec 13, 2012

Machine Learning poster session at Stanford University

Today I attended the Stanford machine learning class (CS 229) poster session. The class was taught by Prof. Andrew Ng and has attracted 575 students, with many from companies in the local area. Total 250 projects were presented in this session.

As I walked around, I was amazed at the breadth and depth of research projects presented, many of which has immediate practical applications. I came across posters from company employees such as Walmart labs, eBay and Apple. They apply techniques learned from this class to some interesting real-world problems.

Here are some highlights:
1. Detect web traffic anomaly on an E-commerce site, and verify the anomaly with Twitter data.
 This connects mining user sentiment on Twitter (chat about this company's problem), to tracking the web performance of a company's website. The sentiment is detected by using the AFINN dataset, which contains words with scores indicating their positive and negative sentiment.

2. Recommend TV shows to facebook users based on their profile and likes (if they are willing to share their info). The students experimented with various recommender system techniques such as SVD, item-item similarity and user-user similarity. They also tried other methods such as bipartite graph, and Naive Bayes to predict user likes. 
     Other recommender systems in this poster session:
  • Beer recommendation engine
  • Best Buy Recommendation System.
 3. Design a smarter way to model stock price change, with a modified MCMC (Markov Chain Monte Carlo) algorithm. Stock prices looks highly volatile. Traditional approach comes from particle modeling in physics. Recently MCMC approach generates better results. This project is a further improvement on MCMC modeling. 

4. Separate singing voice from background music, with the PLCA (Probabilistic Latent Component Analysis) approach. It is a large matrix operation. This approach was presented in NIPS 2006 conference, and has been shown to be much more powerful than traditional approach.

There are many many more. Poster photos are listed at  Machine Learning poster session at Google+ event.

I want to end with a fun project: using an Xbox camera to learn users' hand motions and what they type. It seems there are endless things we can apply machine learning to.  

Dec 12, 2012

Feature selection – An essential step of machine learning

A machine learning model creates prediction from input variables (features) to the output. The output can be binary, such as whether a car is defect or not; it can also be numerical, such as the inflation rate of next year. A good machine learning model uses the minimum number of features to get the most accurate prediction. In many cases, a smaller number of features can gives better performance than a larger number of them. This is because additional features add noise and leads to overfitting. Thus the model does not do well with new test cases.

There is another reason for feature selection: computational speed. A model with a large number of features takes a long time to build, and take too long to apply in real time.

The third reason is sample size vs. the number of features. If the number of features is large, we will overfit the model and cannot create good prediction. In gene expression study, there are more than 100,000 features, but only a few hundred data points.  In order to increase the precision of the model, we need to increase sample size. For supervised learning, this means more human labeling on additional data. Sometime this is not possible.

How do we go about selecting the best and smallest number of features? If we try all subset of n features, it will take times of computation. This is apparently infeasible.

Feature selection has been studied in machine learning field for the last 2 decades.The following 4 methods have been proposed:
  1. Forward search: This method adds feature 1 by 1. It's a greedy algorithm that does not always lead to the best solution
  2. Backward search: This method starts with all features, and substract 1 by 1, until the performance no longer improves. It is also a greedy algorithm.
  3. Adaptive Lasso
  4. L1-regularized Logistic Regression 
I am in favor of the last method, L1-regularized LR, as it consistently gives us the best performance among all feature selection methods. Hopefully in future blogs, I can expand on this discussion  a little more.

Dec 11, 2012

Mining Traffic Video

In traffic intersections, video cameras constantly record the information of traffic flow, vehicle information and activities. How do we mine such video quickly (in real time) and capture the information we need?

To mining video data, first we need to understand how it is different from other types of data. Video is essentially a sequence of image frames. Video data of a normal camera has about 30 image frames per second. Each image frame is identical to a still image.

The first step of video mining is segmentation. This is similar to the first step in image mining. The goal of segmentation is separating objects (such as vehicles) from their background. In the following pictures, the left is the full picture, and the right is the background.
 One way to achieve segmentation is through background subtraction, where a reference (background) frame is generated. Any image is subtracted by this reference framework, which results in the remaining objects.

The second step of video mining is object tracking. For each object, a bounding box is calculated and the centroid of this box is located. Then the position of centroid is tracked from frame to frame. The change of such centroid gives us the movement of the object. 

The ability of tracking object allows us to estimate traffic flow through a interaction between, say 9 am to 10 am.

For more information on how to mine traffic data, please see this paper:
Chen, Shu-Ching, et al. "Multimedia data mining for traffic video sequences." Proc. of International Workshop on Multimedia Data Mining (MDM/KDD’2001). 2001.

Dec 10, 2012

Introduction to N-grams

In analyzing text documents, we can count the frequency of words appearing together in a fixed order. For example, we can count 2-word phrases like “baseball game”, “baseball card”, and “baseball player” etc. Similarly we can count 3-word phrase like “baseball game online”, “baseball game today”, and “baseball game reports”. This approach of counting all adjacent n words in a document is called n-gram approach. The 1-grams are all individual words in the documents, 2-grams are the adjacent 2-word phrases, and so on.

For example, in the following sentence:
A Major League Baseball game was held in Salt Lake City 40 years ago.
1-grams are: {a, Major, League, Baseball, game, was, held, in, Salt, Lake, City, 40, years, ago}
2-grams are: {a Major, Major League, League Baseball, Baseball game, game was, was held, held in, ... }
3-grams are: {a Major League, Major League Baseball, League Baseball game, Baseball game was,...}
Similarly we can count 4-grams, 5-grams and so on. 

Google has an n-gram viewer that counts n-grams in all Google books with certain periods. The chart in the beginning of this blog shows the count for 2-grams “baseball game”, “baseball card”, and “baseball player” in Google books between 1950 and 2008. 

The count of n-grams can be used to predict people’s next search keywords. Suppose we have a collection of all search keywords in the last 2 years. In this collection, the 2-gram “baseball express” has the highest count, followed by “baseball cards”. The 2-gram with higher count is more likely to appear with the 2-grams with lower count. Thus after a user types “baseball”, the next word is more like to be “express” than “cards” in the online search scenario. Here is a snapshot of Yahoo! Search suggestion.
N-grams are also used in spell checking, where recommended words are determined by the surrounding words. They are used in speech recognition, where word identification is based on words uttered before.


Dec 7, 2012

The basics of Image Mining

In online stores, listings of products normally come with pictures. For people who purchase dresses, the difference in images would make a lot difference. A good product search engine would be able to match search keywords with listing text and images combined.

How do we process images in search engine? Here are the steps:
First, we segment the image into the foreground and background. Below are examples of segmentation done by a computer algorithm, with the top row as original picture and bottom row the segmented images with the background whitened out.
The segmentation algorithm is based on GrabCut, which is essentially creating a rectangular bounding box around the foreground object, and then removing all the pixels outside the bounding box.

After the segmentation is done, we extract the photographic and object feature from an image.
Photographic features include:
  • Aspect Ratio (image height divided by its width)
  • Brightness 
  • Contrast 
  • Colorfulness
Object features include:
  • Color histogram
  • Texture histogram
  • Shape features
An image will be represented as a vector of these features.
In product search engine, the image vector of a listing is combined with text vector to represent the listing.