Apr 22, 2014

Product attribute extraction

One of the most popular application for named entity detection is product attribute extraction. Automatically extracting product attributes from text helps E-commerce company correctly process user query and match it the right products. Marketers want to know what people are talking about related to their products. Corporate strategists wants to find out what products are trendy based on user discussions. 

Product attributes are specific properties related to a product. For example, Apple iPhone 5 with black color has 4 major attributes: Company name as “Apple”, brand name as “iPhone”, generation as “5”, color as “black”. A complete set of attributes define a product.  

What challenge do we face in extracting product attributes? We can create a dictionary of products and their corresponding attributes, but simply relying on a dictionary has its limitations:

  • A dictionary is not complete as new products are created. 
  • There is ambiguity in matching the attributes. 
  • Misspelling, abbreviations, and acronyms are used often , particularly in social media such as Twitter. 

Let’s look a case study of eBay (results were published), and see how they solve this problem. eBay is an online marketplace for sellers and buyers. Small sellers create their listings to be sold on eBay. With more than 100 million listings  on the site, sellers are selling everything from digital cameras, clothes, cars to collectibles. In order to help users quickly find an item they are interested in, eBay’s (product) search engine has to quickly match user query to existing listings. It is crucial that eBay group existing listing into products so that the search can be done quickly. This requires automatically extracting product attributes from each listing. However, they face the following challenges:

  • Each listing is less than 55 character long, which contains little context. 
  • Text is ungrammatical, with many nouns piled together.
  • Misspelling, abbreviations and acronyms occur frequently

As we can see, a dictionary-based approach does not work here due to the large volume of new listings and products on the site, and high variation in stating the same product attributes. For example, the brand “River island” has the following appearance in listings: 
river islands, river islanfd, river islan, ?river island, riverislandtop. 

How do we apply machine learning to this problem?  In a supervised learning approach, a small set of listings are labeled by humans. In each listing, each word is tagged either with an attribute name or “Other”. Different product categories have different attributes. For example, in clothing category, there are 4 major product attributes: Brand, garment type, color and size. The following is an example of a labeled listing:

                    Ann Taylor Asymmetrical  Knit Dress     NWT Red   Petite Size S
                    Brand                               Garment type          Color  Size        size

But supervised learning is very expensive as it requires a lot of labeled data. Our goal is using a small amount of labeled data and derive the rest based on bootstrapping. This is called semi-supervised learning. For detail of our semi-supervised learning approach, please see the paper in the reference. 

Duangmanee Putthividhya and Junling Hu, "Bootstrapped Named Entity Recognition for Product Attribute Extraction". EMNLP 2011: 1557-1567

Jan 20, 2014

Statistics and Data Mining

Probability theory is the core of data mining. In the early day of data mining, people count the frequent pattern of those who buy diaper also buy beer. This is essentially deriving a conditional probability buying beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).

Here is example: Suppose 5% of the shoppers buy diaper, and only 2% of shoppers buy both diaper and beer, what the likelihood of some buys diaper after buying beer?  The answer is a conditional probability.
The so-called “association rule” in data mining can be much clearly understood through these statistics terms. 

In machine learning, the core concept is predicting whether a data sample belong to a certain population (class). For example, we want to predict whether a person is a potential buyer, or a credit card transaction is fraudulent. Such prediction is always associated with a confidence score between 0 and 1. When the confidence score is 1, we are 100% sure about the prediction. Generating such confidence score requires statistical inference.

Statistics is essential when evaluating a data mining model. We use control group or conduct A/B test. For example, if we test with 1000 people in treatment group (Group A) vs. 1000 people in control group (Group B) and find Group A has better performance, is this result conclusive? In other words, is it statistically significant? That has to be answered by statistic knowledge.

In data collection phrase, statistics helps us to decide the size of training data (sample), and whether it is representative. Not to mention, some current popular machine learning methods such as logistic regression were originally developed in statistics community.

As mathematics provides foundation for physics, statistics has now become a foundation for machine learning. Its importance will become more prominent overtime in data mining.

Jan 3, 2014

Data Mining Conferences in 2014

A new year has come and exciting data mining conferences are lining up on the horizon. Here is a list of major conferences in this field. They are good places to socialize with other data science researchers and practitioners, to connect with potential candidates or future employers.

Last year, I missed KDD 2013. This year, this conference is on my top list. Here is a list of all major ones.


·                  SDM (SIAM Conference on Data Mining),  April 24-26, Philadelphia
·                  KDD (Knowledge Discover and Data Mining), Aug 24-27, New York City


·                  Strata (O'Reilley conference), Feb 11-13, Santa Clara, California
·                  Predictive Analytics World, March 16-21, San Francisco, California


·                  ICML (machine learning), June 21-26, Beijing, China
·                  AAAI (Artificial Intelligence), July 27-31, Quebec City, Canada
  • ICDM (International conference on Data Mining), Dec 14-17, Shenzhen, China

 Specialized area
·                  WWW  (web data, text mining), May 7-11, Seoul, Korea
·                  ACL (Natural Language processing), June 23-25, Baltimore
·                  SIGIR  (text mining), July 6-11, Gold Coast, Australia
·                  Interspeech (Speech mining), Sept 14-18, Singapore
·                  Recsys (Recommender system), Oct 6-10, Foster City

Jan 2, 2014

Hiring data scientists

Many times I am asked by friends and colleagues on who are data scientists. Many are interested in answers to a very practical question: “Who should I hire as a data scientist?”

In my practical experience in building data science teams, I have come to appreciate the following qualities:
  1. A fundamental understanding of machine learning. Ultimately data mining cannot exist without machine learning, which provides core technique. Thus a researcher in machine learning or related fields (such as natural language processing, computer vision, artificial intelligence, or bioinformatics) is an ideal candidate. They have studied different machine learning methods, and know the newest and best techniques to apply to a problem.   
  2. A sophisticated understanding of statistics and advanced mathematics. Such understanding requires years of training. Thus a Ph.D. degree is typically required for data scientists.
  3. Training in computer science. Ultimately, mining data is a way of computing. It requires design of computer algorithms that are efficient in memory (space) and time.  People who are trained in computer science understand the tradeoff of space and time in computer. They understand the basic concept of computational complexity. Someone who has majored in computer science would have this training ingrained in their DNA. 
  4. Good coding skill. We live in a big data era. In order to work with data, we write code to process them, clean them, and transform them. Then we need to create programs on big data platform, and test and improve the program constantly. All of these require good coding skill. Data mining is about implementation and testing. Programming skill is thus a core requirement.    
In hiring a data scientist, a few other qualifications are desirable but not required:
  1. Experience with big data. This enables someone to work in certain environments such as Hadoop, and use the tool fast. But such knowledge can be easily learned. 
  2. Knowledge of a specific program language. A good programmer can easily learn any new language quickly. In addition, there are many options to run big data program, from Python, to Java, to Scala. If a person masters any one of these languages, he can be very productive. 
A good data scientist who satisfies the 4 basic-skill requirements is hard to find today. Even though our universities train tens of thousands of them each year, the market demand is way higher than that. Many people have read this report by McKinsey, which states that there will be 140,000 job gap (higher demand than talent supply) for data scientists by 2018.

Even today, in early 2014, companies are struggling to bring in data scientists. Those who are on the job market are immediately snatched away by large and well-known companies.  Today, every company is trying to implement “data strategy” (or “big data strategy” in its fancier term).  This is a golden age for data scientists but a challenging time for employers.