Dec 30, 2012

Natural Language Processing in patent litigation: Lex Machina

It is always exciting to see Natural Language Processing being put in practical use. In a previous blog, I mentioned about a startup using NLP for business products: Radius Intelligence (extracting business names and locations).

Today I want to introduce a 2nd company in this domain: Lex Machina. This one is much closer to home (literrally). It is based in Palo Alto, founded by a couple of Stanford professors and students, one from Law school and one from computer science department. Their product is patent litigation analysis, completely automated.

Given a lot of patent lawsuits around this country, the company creates a daily index and searchable database on major lawsuits. The value of tracing patent lawsuit is apparent to many large companies, as we can tell from Apple and Samsung’s fierce fight. Back in late 1990s, eBay lost a patent litigation on an obscure auction algorithm written by unknown person, even after spending a large sum on lawyers and expert witnesses. Not to mention Google acquired Motorola mainly for its large patent portfolio.

Lex Machina has a web crawler that crawls major court case websites and bring back the document and index them. I can see 2 major techniques from NLP used here: (1) Named Entity Recognition (2) Text classification (with supervised learning).

Beyond these techniques, this company also converts the documents by optical character recognition (OCR) to searchable text and stores each one as a PDF file. In addition, the crawl is done on a few major litigation websites, making the problem tractable.

I look forward to seeing the succcess of Lex Machina. 

Dec 20, 2012

How 'Big Data" became the word of the year

In today's NPR Fresh Air program, linguist Geoffrey Nunberg chose "Big Data" as the word of the year (listen to this podcast).  I personally agree.

Data mining was a big part of Obama campaign's victory secret this year, but also the word is also mentioned widely in many publications.

Back in Feb 2010, Economist ran a cover story on The Data Deluge. It essentially said "Data, data everywhere".

But the term "big data" started to appear only on this year. In New York Times' article on February 11, 2012,  "The Age of Big Data", we start to see the term get used. New York Times then had a follow-up article in August titled "How Big Data Became So Big".

Harvard Business Review had its October 2012 issue devoted to "Big Data", discussing its impact on management revolution to career choice as data scientists.

On November 18,  2 weeks after the election, Wall Street Journal  had an article on Obama's 'Big Data' Victory.  

Indeed, "Big Data" has become a buzz word, and I vote for it to be the Word of the Year too!

(By the way, I like Numberg's comment: "In the old days nobody knows you are a dog on the Internet. Now everybody knows what dog food you buy.")

Dec 18, 2012

User Click Prediction

Click prediction has been well studied in online advertisement and sponsored search domains. The probability of clicking on a link can be modeled with a statistical classifier. Such a classifier assigns probabilities to two possible outcomes: click and no click. Thus the problem can be transformed into training a good binary classifier that predicts the likelihood of clicks, and the training can be done
with existing query log data.

There are two ways of improving a classifier: One is selecting the most appropriate machine learning algorithm, the other is improving features for the classi er.

People have experimented with different machine learning algorithms for click prediction, including ranking SVM, binary SVM, probit regression, decision trees, and gradient boosted decision trees. We use logistic regression more often. This algorithm is fast and performs as well as SVM. In addition, it is  less prone to over fitting.

The second way to improve a classi fier's performance is using good features. For this purpose, people tried various features for click prediction model. The most important feature is historical Click Through Rate (CTR) when such data exist. However, on many websites, new items or documents are introduced daily. They don't have click history. In this case, many other features can be used.

Two major types of features are item features and user features. Item features include information of a document or a product. For example, an item feature for a product can be price and posting date etc. User features include both demographic information and behavior information. Behavior information is user browsing behavior such as dwell time and query reformulation. In addition, we can use text features from user query.

Dec 17, 2012

BlackLotus and Product Assortment

It is reported in TechCrunch today that Home Depot acquired startup BlackLotus. BlackLotus applies data mining to two major problems facing retail stores: Pricing and Product Assortment. The Pricing solution got a lot of mention in this article. I want dive deeper on the second solution provided by BlackLotus: Product assortment.

Since the early days of data mining (which started in 1993, with the publication of Rakesh Agarsal's seminal paper on frequent pattern mining), researchers has been paying attention to product selection problem. The famous "diaper and beer" story illustrates how frequent pattern mining discovers young fathers who shop for diaper also shop for beer on Friday nights. The purchase pattern of users on an assortment of products give retailers hint on what to display together. It also gives retailers a tool to decide what to replenish when products are sold out.

Each retailer faces the problem of limited shelf space. Take Home Depot as an example. If they decide to replenish oriental rug, but find that this kind of rug is bought together with long dinning table. Then they should consider increasing the stock of those long dinning tables. Given the limited shelf space, they may have to decrease the stock of contemporary rugs if that rug does not sell well. Thus the decision has to be made between two product assortments: {oriental rug, long dinning table} and {contemporary rug}. This decision considers the frequency of purchase of each assortment, and the profit each sale brings.

The data mining approach to Frequent Pattern Mining gives us a fast and efficient way to calculate frequency of any product. The transaction table is recorded as in Table 1, where each transaction correspond to each row, and the columns are the products. One person can have multiple rows if that person comes back to shop again. In this table, we assume that are total 9 million purchases (in a month) in Home Depot.
Transaction ID
Oriental Rug
Dinning table

                                    Table 1. Transaction table

The data mining field has invented 2 major algorithms to calculate the purchase frequency of any product set, or itemset (a set of product items): Apriori and FP-tree (Frequent-pattern tree). The first is a breadth first search method, and the second is essentially a depth-first search.

We will talk about the details of these algorithms in the future if time allows. 

Dec 16, 2012

Major Data Mining Conferences in 2013

Here is a list of major data mining conferences in 2013.The one I highly recommend is KDD, as it represents the best Data Mining research from both industry and academic.The site for 2013 is not yet complete, but you can take a loot at KDD-2012 to get a flavor of its programs.

What's interesting is the general data mining conferences are all in the US for this year (2013), but sub-area conferences are in interesting locations like Brazil, France, or Bulgaria.

  • SDM (SIAM Conference on Data Mining),  May 2-4, Austin, Texas
  • KDD (Knowledge Discover and Data Mining), Aug 11-14, Chicago, Illinois
  • ICDM (International conference on Data Mining), Dec 8-11, Dallas, Texas


  • ICML (machine learning), June 16-21, Atlanta, Georgia
  • AAAI (Artificial Intelligence), July 14-18, Bellevue, Washington

 Specialized area
  • WWW  (web data, text mining), May 13-17, Rio de Janeiro, Brazil
  • SIGIR  (text mining), July 28-31, Dublin, Ireland
  • SIGGraph  (Image/video mining), July 21-25, Anaheim, California
  • ACL (Natural Language processing), Aug 4-9, Sofia, Bulgaria
  • Interspeech (Speech mining), Aug 25-29, Lyon, France
  • Recsys (Recommender system), Oct 12-16, Hong Kong

Dec 15, 2012

Clustering – An unsupervised learning method

Typical machine learning has well-defined outcomes: A review is either positive or negative, a mammogram scan either shows breast cancer or not. With unsupervised learning, we don’t know the outcome. What we can do is simply grouping similar items together, and form clusters. This is so-called clustering method. 

Clustering does not require training or human labeling on training data, thus it is a very convenient tool to use when we have a large number of new data, which may be fundamentally different from old data. This happens on the Internet domain, where new documents appear daily. It also happens with E-commerce domain, where new products appear weekly or monthly.

Clustering gives us freedom to form groups without prior knowledge. Such freedom also presents problems: How many clusters do we form? How do we define similarity? Once clusters are formed, how do we verify their correctness?

We can arbitrarily decide on the number of clusters beforehand. Let’s call this number k. To start clustering, we randomly pick k data points and assume they are the centers of k clusters (called centroid). Then we calculate the distance of other data points to each of these centers. Each of these data points are assigned to the neighborhood of the centroid that it is closest to. Once all data points are assigned, we will have k clusters. In round 2, we re-calculate the centroid of each cluster (Mathematically it is the mean of all the data points in a cluster), and re-assign all data points to each of these clusters. In round 3, we will repeat the process. The process will end when all clusters are the same as the last round. 

Let's look at a simple example. Suppose we have 6 people's income as the following: 1, 2, 100, 101, 1000, 1050. Suppose we believe there are 3 possible clusters in this data (k=3). To start, we arbitrarily pick 3 data points as the centers of clusters. Let these initial k centroids be: 1, 100, 101.  The remaining 3 points are then assigned to the clusters of {1}, {100} and {1050}. Based on the shortest distance, we end up with {1, 2}, {100}, {101, 1000, 1050}. In Round 2, our new centroids are 1.5, 100 and 717. Based on these centroids, our new clusters are {1, 2}, {100, 101}, {1000, 1050}. In round 3, the new centroids are 1,5, 100.5, 1025. The new clusters are {1, 2}, {100, 101}, {1000, 1050}. They are the same as the last round, and we have found our clusters. 

The about clustering process is called K-means clustering. It is the most popular method for doing clustering. Another popular method is called Expectation Maximization (EM). We may discuss it in the future blog, if there is opportunity (when no other interesting topic pops up). 

Dec 14, 2012

10 Startups providing cutting-edge Data Mining products

Data mining has matured to the point that new business can be created around it. In bay area, exciting startups have popped up to provide cutting-edge solution on audience targeting, email security, and mobile applications.

1. GetJar   
     Mobile app distributor and virtual currency for user to enjoy premium app, 
     Technique: Recommender system

2. Ooyala
      Provide Video Analytics and personalized video Ad targeting, understand viewer behavior
     Technique: Reinforcement learning, recommender system

3. Rocket Fuel
     Help clients to create targeted advertising campaign
     Technique: Machine learning, big data

4.    Turn
      Targeted advertising, Mining data for digital advertising
     Technique: Machine learning, big data

5.    BlueKai
      Online User tracking and analytics for marketing                             
      Technique: Clustering

6.    Quantcast
      Real-time audience targeting, online bidding   
      Technique: Machine learning

7.   Tellapart      
      Real time bidding on behavior of clients, audience targeting on display ads, with scoring on customers likelihood of viewing
      Technique: Machine learning, Regression

       Web data mining on all information about a person or company, create targeted ranking solution to represent your reputation               
       Technique: Predictive modeling

9. Radius Intelligence
      Web data mining for lead generation of local sales (Mining location, business revenue, etc),             
      Technique: Natural language processing  

10. ProofPoint
     Data security as a service, focusing on email security 
     Technique: Machine learning, big data

Dec 13, 2012

Machine Learning poster session at Stanford University

Today I attended the Stanford machine learning class (CS 229) poster session. The class was taught by Prof. Andrew Ng and has attracted 575 students, with many from companies in the local area. Total 250 projects were presented in this session.

As I walked around, I was amazed at the breadth and depth of research projects presented, many of which has immediate practical applications. I came across posters from company employees such as Walmart labs, eBay and Apple. They apply techniques learned from this class to some interesting real-world problems.

Here are some highlights:
1. Detect web traffic anomaly on an E-commerce site, and verify the anomaly with Twitter data.
 This connects mining user sentiment on Twitter (chat about this company's problem), to tracking the web performance of a company's website. The sentiment is detected by using the AFINN dataset, which contains words with scores indicating their positive and negative sentiment.

2. Recommend TV shows to facebook users based on their profile and likes (if they are willing to share their info). The students experimented with various recommender system techniques such as SVD, item-item similarity and user-user similarity. They also tried other methods such as bipartite graph, and Naive Bayes to predict user likes. 
     Other recommender systems in this poster session:
  • Beer recommendation engine
  • Best Buy Recommendation System.
 3. Design a smarter way to model stock price change, with a modified MCMC (Markov Chain Monte Carlo) algorithm. Stock prices looks highly volatile. Traditional approach comes from particle modeling in physics. Recently MCMC approach generates better results. This project is a further improvement on MCMC modeling. 

4. Separate singing voice from background music, with the PLCA (Probabilistic Latent Component Analysis) approach. It is a large matrix operation. This approach was presented in NIPS 2006 conference, and has been shown to be much more powerful than traditional approach.

There are many many more. Poster photos are listed at  Machine Learning poster session at Google+ event.

I want to end with a fun project: using an Xbox camera to learn users' hand motions and what they type. It seems there are endless things we can apply machine learning to.  

Dec 12, 2012

Feature selection – An essential step of machine learning

A machine learning model creates prediction from input variables (features) to the output. The output can be binary, such as whether a car is defect or not; it can also be numerical, such as the inflation rate of next year. A good machine learning model uses the minimum number of features to get the most accurate prediction. In many cases, a smaller number of features can gives better performance than a larger number of them. This is because additional features add noise and leads to overfitting. Thus the model does not do well with new test cases.

There is another reason for feature selection: computational speed. A model with a large number of features takes a long time to build, and take too long to apply in real time.

The third reason is sample size vs. the number of features. If the number of features is large, we will overfit the model and cannot create good prediction. In gene expression study, there are more than 100,000 features, but only a few hundred data points.  In order to increase the precision of the model, we need to increase sample size. For supervised learning, this means more human labeling on additional data. Sometime this is not possible.

How do we go about selecting the best and smallest number of features? If we try all subset of n features, it will take times of computation. This is apparently infeasible.

Feature selection has been studied in machine learning field for the last 2 decades.The following 4 methods have been proposed:
  1. Forward search: This method adds feature 1 by 1. It's a greedy algorithm that does not always lead to the best solution
  2. Backward search: This method starts with all features, and substract 1 by 1, until the performance no longer improves. It is also a greedy algorithm.
  3. Adaptive Lasso
  4. L1-regularized Logistic Regression 
I am in favor of the last method, L1-regularized LR, as it consistently gives us the best performance among all feature selection methods. Hopefully in future blogs, I can expand on this discussion  a little more.

Dec 11, 2012

Mining Traffic Video

In traffic intersections, video cameras constantly record the information of traffic flow, vehicle information and activities. How do we mine such video quickly (in real time) and capture the information we need?

To mining video data, first we need to understand how it is different from other types of data. Video is essentially a sequence of image frames. Video data of a normal camera has about 30 image frames per second. Each image frame is identical to a still image.

The first step of video mining is segmentation. This is similar to the first step in image mining. The goal of segmentation is separating objects (such as vehicles) from their background. In the following pictures, the left is the full picture, and the right is the background.
 One way to achieve segmentation is through background subtraction, where a reference (background) frame is generated. Any image is subtracted by this reference framework, which results in the remaining objects.

The second step of video mining is object tracking. For each object, a bounding box is calculated and the centroid of this box is located. Then the position of centroid is tracked from frame to frame. The change of such centroid gives us the movement of the object. 

The ability of tracking object allows us to estimate traffic flow through a interaction between, say 9 am to 10 am.

For more information on how to mine traffic data, please see this paper:
Chen, Shu-Ching, et al. "Multimedia data mining for traffic video sequences." Proc. of International Workshop on Multimedia Data Mining (MDM/KDD’2001). 2001.

Dec 10, 2012

Introduction to N-grams

In analyzing text documents, we can count the frequency of words appearing together in a fixed order. For example, we can count 2-word phrases like “baseball game”, “baseball card”, and “baseball player” etc. Similarly we can count 3-word phrase like “baseball game online”, “baseball game today”, and “baseball game reports”. This approach of counting all adjacent n words in a document is called n-gram approach. The 1-grams are all individual words in the documents, 2-grams are the adjacent 2-word phrases, and so on.

For example, in the following sentence:
A Major League Baseball game was held in Salt Lake City 40 years ago.
1-grams are: {a, Major, League, Baseball, game, was, held, in, Salt, Lake, City, 40, years, ago}
2-grams are: {a Major, Major League, League Baseball, Baseball game, game was, was held, held in, ... }
3-grams are: {a Major League, Major League Baseball, League Baseball game, Baseball game was,...}
Similarly we can count 4-grams, 5-grams and so on. 

Google has an n-gram viewer that counts n-grams in all Google books with certain periods. The chart in the beginning of this blog shows the count for 2-grams “baseball game”, “baseball card”, and “baseball player” in Google books between 1950 and 2008. 

The count of n-grams can be used to predict people’s next search keywords. Suppose we have a collection of all search keywords in the last 2 years. In this collection, the 2-gram “baseball express” has the highest count, followed by “baseball cards”. The 2-gram with higher count is more likely to appear with the 2-grams with lower count. Thus after a user types “baseball”, the next word is more like to be “express” than “cards” in the online search scenario. Here is a snapshot of Yahoo! Search suggestion.
N-grams are also used in spell checking, where recommended words are determined by the surrounding words. They are used in speech recognition, where word identification is based on words uttered before.

Dec 7, 2012

The basics of Image Mining

In online stores, listings of products normally come with pictures. For people who purchase dresses, the difference in images would make a lot difference. A good product search engine would be able to match search keywords with listing text and images combined.

How do we process images in search engine? Here are the steps:
First, we segment the image into the foreground and background. Below are examples of segmentation done by a computer algorithm, with the top row as original picture and bottom row the segmented images with the background whitened out.
The segmentation algorithm is based on GrabCut, which is essentially creating a rectangular bounding box around the foreground object, and then removing all the pixels outside the bounding box.

After the segmentation is done, we extract the photographic and object feature from an image.
Photographic features include:
  • Aspect Ratio (image height divided by its width)
  • Brightness 
  • Contrast 
  • Colorfulness
Object features include:
  • Color histogram
  • Texture histogram
  • Shape features
An image will be represented as a vector of these features.
In product search engine, the image vector of a listing is combined with text vector to represent the listing.

Dec 5, 2012

Major areas of data mining

Data Mining is a big field now. Each year, several major international conferences are held on the methodology alone (KDD, ICDM, SDM etc).  This does not include specialized conferences on text mining, video mining and so on. On top of these academic conferences, there are several major industry conferences such as Strata and Predictive Analytics World.

We can divide data mining fields based on algorithms or methods:

  • Association Rule Mining (including frequent pattern mining)
  •  Examples are the "beer and diaper" story on shopping in Walmart. It also applies to retail store product selection.
  • Machine Learning  
        o   Supervised  (“predictive modeling”)
§  Neural Network
§  Decision Tree
§  Logistic Regression
§  Support Vector Machine
o   Unsupervised ("clustering")
§  K-means
§  Expectation Maximization (EM)

  • Time series analysis
  • Recommender systems
The areas of data mining can also be separated by data format:
  • Numeric or categorical data – so called “structured” data
  • Text data – Text Mining and Natural language processing
  • Audio data
  • Image data
  • Video data

There are two types of data people work with:

  • Static Data – retrieved from database
  • Stream data
Stream data mining is addressing the problem of continuous incoming data, and very little memory to store all the data. Sometimes the system has to make real-time response.

I hope to get deeper into these areas in future blogs. 

Dec 4, 2012

Mining Data Streams

Tonight SVForum (Silicon Valley Forum) and its Business Intelligence SIG will present a talk on stream data computing. If I found some interesting content there, I will report it on this blog.

Stream data are fascinating, as they are very different from traditional data we normally get from database. Examples of stream data are: data sent by wireless sensors, RFID data, heart rate monitoring, and network traffic. Such data are large volume and they come in constantly. Thus there is inherent time component in stream data.

Typical stream data mining tried to find "frequent sequential patterns". One of such patterns is called Motif. A motif is a data sub-sequence that repeats throughout.  When we apply motif discovery to heart monitoring data, we get the following:

The green ones are repeated throughout, and blue ones are also motif. The following is two dance sub-sequences that are similar to each other. They correspond to the Jackson move.

For more in-depth discussion on motif discover, check out these two papers:
1. Mueen, Abdullah, and Eamonn Keogh. "Online discovery and maintenance of time series motifs." KDD 2010
2. Jingjing Meng (and others) "Mining Motifs from Human Motions", EuroGraphics 2008

Dec 3, 2012

How GetJar Recommends Mobile Apps

GetJar is a young company based in San Mateo, California. They make recommendation to mobile users on what applications to download. Today GetJar has more than 75 million users. TechCrunch had an article on this fast growing startup.

The problem of Get Jar is very different from Netflix in that the user feedback is “implicit”. In the original Netflix problem, the feedback is user ratings for movies. Such feedback is called “explicit”. In certain applications such as Yelp review or Expedia travel, users give ratings to restaurants or hotels. In these cases, recommender system worked for Netflix would also work.

However, in many other cases a company only observes implicit feedback. A retails store only observes user purchases. An online search or online game company only observes click behavior or playing behavior. A more practical recommender system should be able to use such implicit feedback to make recommendation

For GetJar, the feedback data are app usage. The feedback can be encoded as binary (1 or 0). If a user has used an app, then that user’s entry for an app is 1, otherwise it is 0. (GetJar also tried to use more information such as the number of days using an app, but found no difference in recommendation results)
For more technical detail on GetJar’s recommender system, please see paper "GetJar Mobile Application Recommendations with Very Sparse Datasets" by Kent Shi and Kamal Ali  (from GetJar),  presented in KDD 2012 conference.

If you are interested in other recommender systems presented at KDD 2012, Urban Mining has a nice collection of them.  

Dec 2, 2012

How Netflix Makes Recommendation for "Movies you will love"

In late October, Xavier Amatriain from Netflix gave a talk at the ACM bay area chapter data mining SIG monthly meeting. He touched on interested algorithms used at Netflix for recommending movies to the users.

The two core algorithms used are restricted Boltzmann Machine (RBM) and Singular Value Decomposition (SVD). Restricted Boltzmann Machine is a type of neural network with 1-hidden layer (See an illustration of below). In Netflix case, one RBM is built for every user. The input node for each RBM is a movie that user has rated. 

Singular Value Decomposition is a special method of decomposing a matrix into 2 or 3 matrices (depending on whether there is a diagonal matrix). In Netflix case, it decomposes the original user-movie rating matrix (where row is users, and columns correspond to movies) into 2 sub-matrices, one is a user factor matrix and one contains movie factors. This is useful when there are many missing values in the original matrix, which is the case when a user only rate a small amount of movies. The decomposed new matrices are smaller and have all cells filled with non-zero values.

Both methods with their special application for Netflix data were originally proposed in the paper titled "Restricted Boltzmann Machines for Collaborative Filtering" by Ruslan Salakhutdinov and his colleagues, presented in the International Conference of Machine Learning (ICML) in 2007. This papers presented some scalable computing methods, and thus was adopted by Netflix.

The full slides of Xavier Amatriain's talk is available here.

Dec 1, 2012

A wish for recommender system at Expedia

I enjoy using Expedia. Its interface is clean and the tools are very easy to use. There is no glitch in any transaction. Thus, for the past year or so I have become a loyal Expedia customer: Reserving almost all my travels on Expedia website.

Last week I was curious about possible vacation deals Expedia may have, wondering if Expedia can recommend some good one for me. After quite a few bookings, I assume Expedia already know what places I am interested in going, or what vacation I would enjoy. Maybe I am spoiled by Amazon and Netflix, who always knows what product I "may buy together"  or what movie I "may be interested". Here is the screenshot I got (shown to the left here) from Expedia. As I look at it, I wonder: Is this what I am supposed to be interested? Or is it just generic ads Expedia display to every customer? Austin is certainly not my favored destination, and the deals are not that interesting (I found out it is indeed a generic page after I logged out and got on the same page).

Suppose Expedia has a more targeted display. Using recommender system, Expedia can show destination that is of special interest to me. How would Expedia know? it can look at other users similar to me, and look at those people's favorite destination for holiday seasons. How do we define similarity of 2 users? There are many ways Expedia can go about doing this, including using the following information: past bookings, click behavior, ratings and comments the users gave to destinations.

All the above information can be captured and sorted together with all the users. Each user thus has a signature "profile". From there, the system can extract user similarity, product (i.e. vacation package) similarity, and then apply even fancier recommender system techniques (such as matrix decomposition) to make personalized recommendation.

When a website makes personalized recommendation, it is more engaging and makes more sales. I  hope to see my "personalized" vacation package on Expedia soon.