Pages

Apr 22, 2014

Product attribute extraction

One of the most popular application for named entity detection is product attribute extraction. Automatically extracting product attributes from text helps E-commerce company correctly process user query and match it the right products. Marketers want to know what people are talking about related to their products. Corporate strategists wants to find out what products are trendy based on user discussions. 

Product attributes are specific properties related to a product. For example, Apple iPhone 5 with black color has 4 major attributes: Company name as “Apple”, brand name as “iPhone”, generation as “5”, color as “black”. A complete set of attributes define a product.  

What challenge do we face in extracting product attributes? We can create a dictionary of products and their corresponding attributes, but simply relying on a dictionary has its limitations:

  • A dictionary is not complete as new products are created. 
  • There is ambiguity in matching the attributes. 
  • Misspelling, abbreviations, and acronyms are used often , particularly in social media such as Twitter. 

Let’s look a case study of eBay (results were published), and see how they solve this problem. eBay is an online marketplace for sellers and buyers. Small sellers create their listings to be sold on eBay. With more than 100 million listings  on the site, sellers are selling everything from digital cameras, clothes, cars to collectibles. In order to help users quickly find an item they are interested in, eBay’s (product) search engine has to quickly match user query to existing listings. It is crucial that eBay group existing listing into products so that the search can be done quickly. This requires automatically extracting product attributes from each listing. However, they face the following challenges:

  • Each listing is less than 55 character long, which contains little context. 
  • Text is ungrammatical, with many nouns piled together.
  • Misspelling, abbreviations and acronyms occur frequently

As we can see, a dictionary-based approach does not work here due to the large volume of new listings and products on the site, and high variation in stating the same product attributes. For example, the brand “River island” has the following appearance in listings: 
river islands, river islanfd, river islan, ?river island, riverislandtop. 

How do we apply machine learning to this problem?  In a supervised learning approach, a small set of listings are labeled by humans. In each listing, each word is tagged either with an attribute name or “Other”. Different product categories have different attributes. For example, in clothing category, there are 4 major product attributes: Brand, garment type, color and size. The following is an example of a labeled listing:

                    Ann Taylor Asymmetrical  Knit Dress     NWT Red   Petite Size S
                    Brand                               Garment type          Color  Size        size

But supervised learning is very expensive as it requires a lot of labeled data. Our goal is using a small amount of labeled data and derive the rest based on bootstrapping. This is called semi-supervised learning. For detail of our semi-supervised learning approach, please see the paper in the reference. 

Reference:
Duangmanee Putthividhya and Junling Hu, "Bootstrapped Named Entity Recognition for Product Attribute Extraction". EMNLP 2011: 1557-1567

Jan 20, 2014

Statistics and Data Mining

Probability theory is the core of data mining. In the early day of data mining, people count the frequent pattern of those who buy diaper also buy beer. This is essentially deriving a conditional probability buying beer given buying diaper. In mathematical term, it is P(buy beer | buy diaper).

Here is example: Suppose 5% of the shoppers buy diaper, and only 2% of shoppers buy both diaper and beer, what the likelihood of some buys diaper after buying beer?  The answer is a conditional probability.
The so-called “association rule” in data mining can be much clearly understood through these statistics terms. 

In machine learning, the core concept is predicting whether a data sample belong to a certain population (class). For example, we want to predict whether a person is a potential buyer, or a credit card transaction is fraudulent. Such prediction is always associated with a confidence score between 0 and 1. When the confidence score is 1, we are 100% sure about the prediction. Generating such confidence score requires statistical inference.

Statistics is essential when evaluating a data mining model. We use control group or conduct A/B test. For example, if we test with 1000 people in treatment group (Group A) vs. 1000 people in control group (Group B) and find Group A has better performance, is this result conclusive? In other words, is it statistically significant? That has to be answered by statistic knowledge.

In data collection phrase, statistics helps us to decide the size of training data (sample), and whether it is representative. Not to mention, some current popular machine learning methods such as logistic regression were originally developed in statistics community.

As mathematics provides foundation for physics, statistics has now become a foundation for machine learning. Its importance will become more prominent overtime in data mining.

Jan 3, 2014

Data Mining Conferences in 2014

A new year has come and exciting data mining conferences are lining up on the horizon. Here is a list of major conferences in this field. They are good places to socialize with other data science researchers and practitioners, to connect with potential candidates or future employers.

Last year, I missed KDD 2013. This year, this conference is on my top list. Here is a list of all major ones.

Academic+industry

·                  SDM (SIAM Conference on Data Mining),  April 24-26, Philadelphia
·                  KDD (Knowledge Discover and Data Mining), Aug 24-27, New York City

 Industry

·                  Strata (O'Reilley conference), Feb 11-13, Santa Clara, California
·                  Predictive Analytics World, March 16-21, San Francisco, California

Academic

·                  ICML (machine learning), June 21-26, Beijing, China
·                  AAAI (Artificial Intelligence), July 27-31, Quebec City, Canada
  • ICDM (International conference on Data Mining), Dec 14-17, Shenzhen, China

 Specialized area
·                  WWW  (web data, text mining), May 7-11, Seoul, Korea
·                  ACL (Natural Language processing), June 23-25, Baltimore
·                  SIGIR  (text mining), July 6-11, Gold Coast, Australia
·                  Interspeech (Speech mining), Sept 14-18, Singapore
·                  Recsys (Recommender system), Oct 6-10, Foster City

Jan 2, 2014

Hiring data scientists

Many times I am asked by friends and colleagues on who are data scientists. Many are interested in answers to a very practical question: “Who should I hire as a data scientist?”

In my practical experience in building data science teams, I have come to appreciate the following qualities:
  1. A fundamental understanding of machine learning. Ultimately data mining cannot exist without machine learning, which provides core technique. Thus a researcher in machine learning or related fields (such as natural language processing, computer vision, artificial intelligence, or bioinformatics) is an ideal candidate. They have studied different machine learning methods, and know the newest and best techniques to apply to a problem.   
  2. A sophisticated understanding of statistics and advanced mathematics. Such understanding requires years of training. Thus a Ph.D. degree is typically required for data scientists.
  3. Training in computer science. Ultimately, mining data is a way of computing. It requires design of computer algorithms that are efficient in memory (space) and time.  People who are trained in computer science understand the tradeoff of space and time in computer. They understand the basic concept of computational complexity. Someone who has majored in computer science would have this training ingrained in their DNA. 
  4. Good coding skill. We live in a big data era. In order to work with data, we write code to process them, clean them, and transform them. Then we need to create programs on big data platform, and test and improve the program constantly. All of these require good coding skill. Data mining is about implementation and testing. Programming skill is thus a core requirement.    
In hiring a data scientist, a few other qualifications are desirable but not required:
  1. Experience with big data. This enables someone to work in certain environments such as Hadoop, and use the tool fast. But such knowledge can be easily learned. 
  2. Knowledge of a specific program language. A good programmer can easily learn any new language quickly. In addition, there are many options to run big data program, from Python, to Java, to Scala. If a person masters any one of these languages, he can be very productive. 
A good data scientist who satisfies the 4 basic-skill requirements is hard to find today. Even though our universities train tens of thousands of them each year, the market demand is way higher than that. Many people have read this report by McKinsey, which states that there will be 140,000 job gap (higher demand than talent supply) for data scientists by 2018.

Even today, in early 2014, companies are struggling to bring in data scientists. Those who are on the job market are immediately snatched away by large and well-known companies.  Today, every company is trying to implement “data strategy” (or “big data strategy” in its fancier term).  This is a golden age for data scientists but a challenging time for employers.

Nov 11, 2013

Overview of data mining

When people talk about data mining, sometimes they refer to the methods used in this field, such as machine learning.Sometimes they refer to specific data of interest, such as text mining or video mining. Many other times, people use the term "big data" simply referring to the infrastructure of data mining, such as Hadoop or Cassandra. 

Given all of these various references, newcomers of the data mining field feel like lost in a wonderland. 
Here we give an overview picture that connects all these components together. Essentially we can think the data mining field consists of 4 major layers. At the top layer is the basic methodology, such as machine learning or frequent pattern mining. The second layer is the application to various data types. For social networks, this is graph mining. For sensors and mobile data, this is stream data mining. The third layer is the infrastructure, where Hadoop, NoSQL and other environments are invented to support large data movement and retrieval. Finally, at the fourth layer, we are concerned creating a data mining team, understanding the people profiles that support all of the above operations. 


Given this four-layer separation, we can easily see how various discussions on data mining fall in the picture. For example, the hot topic of "deep learning" belongs to the machine learning layer. More specifically, it is part of Unsupervised Learning.  Another topic "Natural Language Processing" (NLP) is part of mining  text data. Not surprisingly,  this field uses machine learning extensively as its core methodology. 

Sep 12, 2013

How casinos are betting on big data

Reported from Yahoo! Finance today, casinos like Caesars crunch big data to find ways to attract gamblers. They can find out if a gambler is losing  or winning too much:  
"They could win a lot or they lose a lot or they could have something in the middle. … So we do try to make sure that people don't have really unfortunate visits," said Caesars CEO Gary Loveman. 

Caesars has been leading data mining in gambling industry. They have more than 200 data experts (including some real data scientists?) in house to crunch data on royalty program, VIP member patterns, and so on.  That work was reported in KDD 2011 Industry prattice expo.

Next time when you visit a casino, expect a suddenly friendlier slot machine after you are on a losing streak ...

The complete Yahoo! Finance report is here

Sep 4, 2013

Machine learning as a service

With data widely available and maturity of machine learning methods, commercial application of machine learning has become a reality. A new type of businesses has sprung up: Providing machine learning services.  It turns out this service fills a big market void. About a dozen companies have rapidly grown in this space in the last few years. Even so, many more startups are formed this year.

Let me take a look at a few successful companies in this space:

1. Mu Sigma
The company derives its name from “mu” and “sigma” in statistics. Based in Chicago but with significant operation in India, Mu Sigma provides analytics for marketing, supply chain and risk management. The company was founded in 2004, but has grown to 2,500 employees. Today, the company is valued at 1 billion.
 
2. Opera Solutions
Opera solutions was founded in 2004, headerquartered in New York but with significant operation in San Diego. The company gathered a group of strong data scientists to work on projects ranging from financial service, health care, insurance to manufacturing. The company scientists participated in important data mining contests such as KDD cup and Netflix contest, demonstrating strong technical and research skills.
Opera Solutions works with large to mid-sized companies, and has been growing rapidly. It raised $84 million funding in late 2011, and then another $30 million in May 2013. Today the company is valued at $500 Million.

3. Actian
       Based in Redwood city, CA, Actian provides data analytic platform.  They acquired another data company ParAccel recently.

4. Fractal analytics
    The company was founded in 2000, and headquartered in San Mateo. It has significant operation presence in India. It provides business intelligence and analytic service for financial service, insurance, and telecommunication industries.  

6. Grok
They provide asset (equipment) management for manufacturers, and predictive modeling. The company was founded in 2005, and based in Redwood city.

5. Alpine Data labs
     They offer in-house consulting and 24-hour initial delivery.  It raised $7.5 million round A funding in May 2011. The company is based in San Mateo.

7. Skytree
The company specifically focuses on machine learning. It just raised $18 million round A funding in April 2013, after starting in early 2012 (with $1.5 million seed fund). The company is based in San Jose.

Jul 26, 2013

Learning to rank and recommender systems

A recommendation problem is essentially a ranking problem: Among a list of movies, which should rank higher in order to be recommended? Among the job candidates, who should LinkedIn display to the recruiters?  The task of recommendation can be viewed as creating a ranked list.

Classical approach to recommender systems is based on collaborative filtering. This is an approach using similar users or similar items to make recommendation.  Collaborative filtering is popularized by the Netflix contest from 2006 to 2009, when many teams around the world participated to create movie recommendation based on movie ratings provided by Netflix. 

While collaborative filtering has achieved certain success, it has its limitations. The fundamental problem is the limited information captured in user-item table. Each cell of this table is either a rating or some aggregated activity score (such as purchase) on an item (from a specific user). Complex information such as user browsing time, clicks, or external events is hard to capture in such table format.

A ranking approach to recommendation is much more flexible. It can incorporate all the information as different variables (features). Thus it is more explicit. In addition, we can combine ranking with machine learning, allowing ranking function evolve over time based on data. 

In traditional approach to ranking, a ranking score is generated by some fixed rules. For example, a page’s score depends on links pointing to that page, its text content, and its relevance to search keywords. Other information such as visitor’s location, or time of day could all be part of the ranking formula. In this formula, variables and their weights are pre-defined.

The idea of Learning to Rank is using actual user data to create a ranking function. The machine learning procedure for ranking has the following steps:
  1. Gather training data based on click information.
  2. Gather all attributes about each data point, such as item information, user information, time of day etc. 
  3. Create a training dataset that has 2 classes: positive (Click) and negative (no click).
  4. Apply a supervised machine learning algorithm (such as logistic regression) to the training data
  5. The learned model is our ranking model. 
  6. For any new data point, the ranking model assigns a probability score between 0 and 1 on whether the item will be clicked (selected). We call this probability score our “ranking score”. 

The training data are constructed from user visit logs containing user clicks or it may be prepared manually by human raters.

The learning to rank approach to recommendation has been adopted by Netflix and LinkedIn today. It is fast and can be trained repeatedly. It is behind those wonderful movie recommendations and connection recommendations we enjoy on these sites. 

Jul 9, 2013

Text Mining: Name Entity Detection (2)

In the machine learning approach for detecting named entities, we first create labeled data, which are sentences with each word tagged. The tags correspond to the entities of interest. For example,
May      Smith   was  born   in   May.
        Person  Person                        Date
Since an entity may have several words, we can use finer tags in IOB (in-out-begin) format, which indicates whether the word is the beginning (B) or inside (I) of a phrase. We will have tags for B-Person, I-Person, B-Date, I-Date and O (other).  With such tagging, the above example becomes
May          Smith      was  born  in    May.
B-Person  I-Person   O     O    O   B-Date

Our training data have every word tagged. Our goal is learning about this mapping and apply it a new sentence. In other words, we want to find a mapping from a word (given its context) to an entity label.

Each word can be represented as a set of features. A machine learning model maps input features to an output (entity class). Such features could include:
Word identity (the word itself)
Word position from the beginning.
Word position from the end.
the word before
the word after
Is the word capitalized?

Our training data would look like the following, where each row is a data point.
Word identity
position from the beginning
position to the end
word before
word after
Is capitalized?
Class
‘May’
0
5
/
‘Smith’
yes
B-Person 
‘Smith’
1
4
‘May’
‘was’
yes
I-Person
‘May’
5
0
‘in’
/
yes
B-Date

Once we create the training data like the above table, we can then train a standard machine learning method such as SVM, decision tree or logistic regression to generate a model (which contains machine generated rules). We can then apply this model to any new sentence and detect entity types related to "Person" and "Date".

Note that we can only detect entity types contained in our training data. While this may seem a limitation, it allows us to discover new instances or values associated with existing entity types.

The approach we have discussed so far is a supervised learning approach, which depends heavily on human-labeled data. When manual labels are hard to come by, we can find ways to create training data by machine. Such approach is called semi-supervised learning, which holds a lot of promise when data get large.

Jul 8, 2013

Text Mining: Named Entity Detection

An interesting task of text mining is detecting entities in the text. Such entities could be a person, a company, a product, or a location. Since an entity is associated with a special name, it is also called Named Entity. For example, the following text contains 3 named entities:
       Apple has hired Paul Deneve as vice president, reporting to CEO Tim Cook.
The first term “Apple” indicates a company, and the second and third are persons.

Named entity detection (NER) is an important component in  social media analysis. It helps us to understand user sentiment on specific products. NER is also important for product search for E-commerce companies. It helps us to understand user search query related to certain products.

To map each name to an entity, one solution is using a dictionary of special names. Unfortunately, this approach has two serious problems. The first problem is that our dictionary is not complete. New companies are created and new products are sold every day. It is hard to keep track all the new names. The second problem is the ambiguity of associating a name to an entity. The following example illustrates this:
As Washington politicians argue about the budget reform, it is a good time to look back at George Washington’s time.
In this text, the first mention of “Washington” refers to a city, while the second mention refers to a person. The distinction of these two entities comes from their context.

To resolve ambiguity in entity mapping,  we can create certain rules to utilize the context. For example, we can create the following rules:
  1. When ‘Washington’ is followed by ‘politician’, then it refers to a city.
  2. When ‘Washington’ is preceded by ‘in’, then it refers to a city.
  3. When ‘Washington’ is preceded by ‘George’, then it refers to a person.
But such rules could be too many. For example, each of the following phrases would generate a different rule: “Washington mentality”, “Washington atmosphere”, “Washington debate” as well as “Washington biography” and “Washington example”. The richness of natural language makes the number of rules exploding and still susceptible to exceptions.

Instead of manually creating rules, we can apply machine learning. The advantage of machine learning is that it creates patterns automatically from examples. No rule needs to be manually written by humans. The machine learning algorithm takes a set of training examples, and chunk out its own model (that is comparable to rules). If we get new training data, we can re-train the machine learning algorithm and generate a new model quickly.

How does the machine learning approach work? I will discuss it in the next post. 

Apr 30, 2013

The future of video mining


IP camera has become inexpensive and ubiquitous. For $149, you can get an IP camera to use at home. This camera saves recording in the cloud, and you never have to worry about buying memory cards. Dropcam offers this amazing service. The camera connects to your wi-fi network and stream real-time video 24 hours a day to the server. You can watch your home remotely from your smart phone or laptop.

Dropcam is a startup based in San Francisco. Founded in 2009, the company has raised 2 rounds of funding, with $12 million of Series B in June 2012. As a young startup, the company has seen rapid user growth. Dropcam CEO Greg Duffy said that Dropcam cameras now upload more video per day than YouTube.

What makes Dropcam unique from other IP camera company is its video mining capability. If you want to review your video of the last 7 days, you don’t have to watching them from the beginning to the end. Dropcam automatic mark the video segment with motion, so that you can jump to those segments right away.

In addition, Dropcam plans to implement face and figure detection so that it can give more intelligent viewing options. Furthermore, the video view software can detect events such as cat running, dinner gathering and so on. The potential of video mining is limitless. Given our limited time to review videos of many hours (or days and weeks) and continuing accumulation of our daily recording, the need for video mining will keep growing.

Apr 29, 2013

Stroke Prediction

Stroke is the third leading cause of death in the United States. It is also the principal cause of serious long-term disability. Stroke risk prediction can contribute significantly to its prevention and early treatment. Numerous medical studies and data analyses have been conducted to identify effective predictors of stroke.

Traditional studies adopted features (risk factors) that are verified  by  clinical trials  or  selected manually by medical experts. For example, one famous study by Lumley and others[1] built a 5-year  stroke prediction model using a set of 16 manually selected features. However, these manually selected features could miss some important indicators. For example,  past studies  have shown that there exist additional risk factors  for stroke such  as  creatinine level,  time  to  walk  15 feet,  and others.

The Framingham Study [2] surveyed a wide range of stroke risk factors including blood pressure, the use of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease, and atrial fibrillation. With  a large  number of features in current medical  datasets, it  is a  cumbersome task to  identify  and  verify  each  risk  factor  manually.  Machine learning algorithms are capable of identifying features highly related to stroke occurrence efficiently from the huge set of features. By doing so, it can improve the prediction accuracy of stroke risk, in addition to discover new risk factors.

In a study by Khosla and others [3], a machine-learning based predictive model was built on stroke data, and several feature selection methods were investigated. Their model was based on automatically selected features. It outperformed existing stroke model. In addition, they were able to identify risk factors that have not been discovered by traditional approaches. The newly identified factors include:

Total medications
Any ECG  abnormality
Min.  ankle  arm  ratio
Maximal  inflation level
Calculated 100 point score
General  health
Minimental score 35 point

It’s exciting to see machine learning play a more important role in medicine and health management.

Reference:

[1]  T.  Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February 2002.
[2] P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: A risk profile from the Framingham study. Stroke, 22:312{318, March 1991.
[3] Aditya Khosla, Yu Cao, Cliff Chiung-Yu Lin, Hsu-Kuang Chiu, Junling Hu, and Honglak Lee. "An integrated machine learning approach to stroke prediction." In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183-192. ACM, 2010.

Apr 26, 2013

History of machine learning

The development of machine learning is an integral part of the development of artificial intelligence. In the early days of AI, people were interested in building machines that mimic human brains. The perceptron model was invented in 1957, and it generated over optimistic view for AI during 1960s. After Marvin Minsky pointed out the limitation of this model in expressing complex functions, researchers stopped pursuing this model for the next decade. .

In 1970s, the machine learning field was dormant, when expert systems became the mainstream approach in AI.  The revival of machine learning came in mid-1980s, when the decision tree model was invented and distributed as software. The model can be viewed by a human and is easy to explain. It is also very versatile and can adapt to widely different problems. It is also in mid 1980s multi-layer neural networks were invented, With enough hidden layers, a neural network can express any function, thus overcoming the limitation of perceptron. We see a revival of the neural network study.


Both decisions trees and neural networks see wide application in financial applications such as loan approval, fraud detection and portfolio management. They are also applied to a wide-range of industrial process and postal office automation (address recognition). 
Machine learning saw rapid growth in 1990s, due to the  invention of World-Wide-Web and large data gathered on the Internet. The fast interaction on the Intern called for more automation and more adaptivity in computer systems. Around 1995, SVM was proposed and have become quickly adopted. SVM packages like libSVM, SVM light make it a popular method to use. 

After year 2000, Logistic regression was rediscovered and re-designed for large scale machine learning problems . In the ten years following 2003, logistic regression has attracted a lot of research work and has become a practical algorithm in many large-scale commercial systems, particularly in large Internet companies.  

We discussed the development of 4 major machine learning methods. There are other method developed in parallel, but see declining use today in the machine field: Naive Bayes, Bayesian networks, and Maximum Entropy classifier (most used in natural language processing). 

In addition to the individual methods, we have seen the invention of ensemble learning, where several classifiers are used together, and its wide adoption today. 

New machine learning methods are still invented each day. For the newest development, please check out the annual ICML (International Conference on Machine Learning) conference. 

Apr 25, 2013

Yelp and Big data


In the Big Data Gurus meetup yesterday, hosted in Samsung R&D center in San Jose, Jimmy Retzlaff from Yelp gave a talk on Big data at Yelp.

By the end of March 2013, Yelp has 36 million user reviews. These reviews cover from restaurants to hair salon, and other local businesses. The number of reviews on Yelp website has grown exponentially in the last few years.

Yelp also sees high traffic now. In January 2013, there are 100 million unique visitors to Yelp. The website records 2 terabytes of log data and another 2 terabytes of derived log every day. While this data size is still small comparing to eBay or LinkedIn, it calls for implementation of big data infrastructure and data mining methods.

Yelp uses MapReduce extensively and builds its infrastructure on Amazon cloud.

Yelp’s log data contain ad display, user clicks and so on. Data mining helps Yelp in designing search system, showing ads, and filter fake reviews. In addition, data mining enables products such as "review highlights", "people who viewed this also viewed...".

Yelp is one example of companies starting to tackle big data and taking advantage of data mining for creating better services.

Apr 24, 2013

Machine Learning for Anti-virus Software


Symantec is the largest anti-virus software vendor. It has 120 million subscribers, who visit 2 billion websites a day and generate 700 billion submissions. Given such a large number of data, it is paramount that an anti-virus software can detect the virus fast and accurately.

Anti-virus software was originally built manually. Security expert review each malware and construct their “signature”. Each computer file is checked against such signatures. Given the rapid change of malware and many variations, there are not enough human experts to generate all the exact signatures. This gives rise to heuristic or generic signatures which can handle more variations of the same file. However, new types of malware are created every day. Thus we need a more adaptive approach to identify malware automatically (without manual effort of creating signatures). This is where machine learning can help.

Computer virus has come a long way. The first virus “creeper” appeared in 1971. Then we have Rabbit or Wabbit. After that came computer worms like “Love Letter” and Nimda. Today computer virus gets much more sophisticated. It evolves much faster and is constantly changing. Virus creation is now funded by organizations and some governments. There is big incentive to steal user financial information or companies’ trade secrets. In addition, malware enables certain governments to conduct spying or potential cyber war on their targets.

Symantec uses about 500 features for their machine learning model. The feature value can be continuous or discrete. Such features include:
How did it come this machine (through browser, email, ..)
When/were
How many other files on this machine?
How many clean files on this machine?
Is file packed or obfuscated? (mutated?)
Does it write, communicate?
How often does it run?
Who runs it?

Researchers at Symantec experiment with SVM, decision tree and linear regression models.

In building a classifier, they are not simply optimizing accuracy or true positive rate. They are also concerned false positive instances where a benign software was classified as malware. Such false positive prediction could have high cost for the users. The balance of true positive vs. false positive leads to using ROC (Receiver Operating Characteristic) curve.

An ROC curve plots the trade-off between true positive rate vs. false positive rate. Each point on the curve corresponds to a cutoff we choose. They use ROC curve to select a target point. Below is an illustration of the tradeoff.

The chart above suggests that when we aim for 90% true positive rate, we will have 20% false positive rate. However, when we only aim for 80% true positive rate, the false positive rate be reduced to 20%. (A better classifier could shift the ROC curve up, so that we achieve high true positive rate for any given false positive rate.)

According their researcher, Symantec has achieved high accuracy rate (the average of True positive and true negative rate) at 95%. Its true positive rate is above 98% and its false positive rate is below 1%.

I am a user of Norton software (by Symantec) and enjoy it. I hope to see more success from Symantec and we are winning the war against malware!

Mar 12, 2013

Basic steps of applying machine learning methods


Deploying a machine learning model typically takes the following five steps:

1. Data collection.
2. Data preprocessing:
    1) Data cleaning;
    2) Data transformation;
    3) Divide data into training and testing sets.
3. Build a model on training data.
4. Evaluate the model on the test data.
5. If the performance is satisfying, deploy to the real system.

This process can be iterative, meaning we can re-start from step 1 again. For example, after a model is deployed, we can collect new data and repeat this process. Let’s look at the details of each step:

1. Data Collection:  
        At this stage, we want to collect all relevant data. For an online business, user click, search queries, and browsing information should be all be captured and saved into the database.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.

2. Data Preprocessing:
       The data used in Machine Learning describes factors, attributes, or features of an observation.  Simple first steps in looking at the data include finding missing values.  What is the significance of that missing value?   Would replacing a missing data value with the median value for the feature be acceptable? For example, perhaps the person filling out a questionnaire doesn't want to reveal his salary.  This could be because the person has a very low salary or a very high salary.  In this case, perhaps using other features to predict the missing salary data might be appropriate.  One might infer the salary from the person’s zip code.  The fact that the value is missing may be important.  There are machine learning methods that ignore missing values and one of these could be used for this data set.
          2) Data Transformation: 
          In general we work with both numerical and categorical data.  Numerical data consists of actual   numbers, while categorical data have a few discrete values. Examples of categorical data include eye color, species type, marriage status, or gender.  Actually a zip code is categorical.  The zip code is a number but there is no meaning to adding two zip codes.  There may or may not be an order to categorical data.  For instance good, better, best is descriptive categorical data which has an order.

3) After the data has been cleaned and transformed it needs to be split into a training set and a test set.

3. Model Building:  
        This training data set is used to create the model which is used to predict the answers for new cases in which the answer or target is unknown.   For example, Section 1.3 describes how a decision tree is built using the training data set.  Several different modeling techniques have been introduced and will be discussed in detail in future sections.  Various models can be built using the same training data set.

4. Model Evaluation
         Once the model is built with the training data, it is used to predict the targets for the test data.  First the target values are removed from the test data set.  The model is applied to the test data set to predict the target values for the test data.  The predicted value of the target is then compared with the actual target value.  The accuracy of the model is the percentage of correct predictions made.  These accuracies of can be used to compare the different models.  Several other ways to compare model accuracy are discussed in the next section on Performance Evaluation.

5. Model Deployment:
        This is the most important step.  If the speed and accuracy of the model is acceptable, then that model should be deployed in the real system.  The model that is used in production should be made with all the available data. Models improve with the amount of available data used to create the model. The results of the model need to be incorporated in the business strategy.  Data mining models provide valuable information which give companies great advantages.  Obama won the election in part by incorporating the data mining results into his campaign strategy.   The last chapter of this book provides information in how a company can incorporate data mining results into its daily business.

Feb 21, 2013

Data Mining and Neuroscience


Bradley Voytek from University of California at SF gave a talk on data mining and neuroscience yesterday in Mountain View. This is part of Big Data Think Tank meetup.

Voytek said, data mining could play an important role in brain study and treatment. Imagine applying data mining to reduce open-skull surgery from 45 minutes to 2 minutes, imagine a better way to analyze and understand fMRI data. Imagine we can apply data mining to understand aging and brain response, helping us to identify ways to improve cognitive function for the elderly. In addition, not mentioned in this talk, are recent studies on applying data mining (specifically association rule mining) to understand Alzheimer’s disease, identifying associations in brain region change.

Voytek also mentioned the need to understand the network of neural signals. This is an exciting domain as it will ultimately help us to improve human cognition, such as hearing and vision. As reported recently in the news, the implementation of “bionic eyes” depends on the mapping of neurons corresponding to visual processing. Deeper understanding of neurons for this function could help us eradicate blindness completely. Imagine a future when there is no more blindness or deafness. How much human suffering can be eliminated!

Neuroscience is the next frontier of science. Understanding human brain and ultimately human consciousness will resolve age-old question on self and soul. With that understanding, imagine we can preserve consciousness or even human memory. It is not far-fetched to think we can watch another person’s memory like watching a movie, if we can truly understand those neural signals (associated memory retrieval).

It is exciting to see data mining can play an important role of in the advancement of science, particularly in neuroscience.

Feb 11, 2013

Data Mining vs. Machine Learning

Data mining and machine learning used to be two cousins. They have different parents. Now they grow increasingly like each other, almost like twins. Many times people even call data mining by the name Machine learning.

The field of machine learning grew out of the effort of building artificial intelligence. Its major concern is making a machine learn and adapt to new information. The origin of machine learning can be traced back to 1957 when the perceptron model was invented. This is modeled after neurons in human brain. That prompted the development of neural network model, which flourished in late 1980s. From 1980s to 1990s, the decision tree method has become very popular, owing to the efficient package of C4.5. SVM was invented in mid-1990s and it has since been widely used in industry. Logistic regression, an old method in statistics, has seen growing adoption in machine learning after 2001 when the book on statistical learning (The Elements of Statistical Learning) was published.

The field of data mining grows out of knowledge discovery from databases. In 1993, a seminal paper by Rakesh Agrawal and two others proposed an efficient algorithm of mining association rules in large databases. This paper promoted many research papers on discovering frequent patterns and more efficient mining algorithms. The early work of data mining in 1990s was linked to creating better SQL statement and working with databases directly.

Data mining has its strong focus on working with industrial problems and getting practical solutions. Therefore it concerns with not only data size (large data), but also data processing speed (stream data). In addition, personalized recommender systems and network mining are all developed due to business need, outside the machine learning field.  

The two major conferences for data mining are KDD (Knowledge Discovery and Data Mining) and ICDM (International Conference on Data Mining). The two major conferences for machine learning are ICML (International Conference on Machine Learning) and NIPS (Neural Information Processing Systems).  Machine learning researchers attend both types of conferences.However, the data mining conferences have much stronger industrial link.

Data Miners typically have strong foundation in machine learning, but also have a keen interesting in applying it large-scale problems.

Over time, we will see deeper connection between data mining and machine learning. Could they become twins one day? Only time will tell. 

Feb 10, 2013

The future of genome data mining


23andMe is a startup based in Mountain View, California. Founded in 2006, its core business is genome sequencing for individuals, and providing additional information on your ancestry and possible disease risk, which you can access on their website. 

The cost of sequencing a person’s genome used to be prohibitive. However, 23andMe with its deep pocket of venture and personal funding (Co-founder Ann Wojcicki is the wife of Google co-founder Sergey Brin), was able to cut the sequencing price from $999 in 2007 to $399 in 2008, then to $299 until end of 2012. In December 2012, with $50 Million Series D funding, 23andMe slashed the price to $99 per person. Such price is probably below their actual testing cost. Why the price cut? 23andMe states that their goal is to get 1 million people participate.

What is the drive behind the large expansion of the user base? The first potential is disease discovery. With a large population, a disease can be more solidly linked to genome data. Suppose we find gene mutation in 1 diabetes patient, it is not enough to conclude that the mutation caused her diabetes. However, if we find the same gene mutation in 1000 diabetes patients, we can be more confident to draw this conclusion. Ultimately it is getting a large enough size of population sample so that we can uniquely link a segment of the gene mutation or ancestral traits to a disease.

By December 2012 (before the price slash), 23andMe has accumulated 180,000 individual genome profiles [1]. So far, this is the largest dataset any one organization has accumulated on human genomes. Combined with the self-reported health profiles of these customers, studies of disease link to gene patterns can be done more conclusively.    

23andMe has partnered with Genentech to study a range of diseases from Alzhermer’s, to breast cancer, and (mostly recently) Avastin. In addition, the company received a small funding from NIH to study allergy and asthma. Given the large population data of genomes, we could see some exciting discovery.

Data mining will play a big role in these new discoveries. Note only data mining enables pattern discovery in a large data where there are many different diseases and persona traits, it can also create predictive models on disease onset related to person’s genome profile. The feature selection technique from data mining also has worked well on genome study where there are more than 20,000 gene features but only a few data points. Even with 1 million people in the data, the problem of small data points could still exist when only a small of group of people have similar diseases (Thus it is important to get even data from more people, ideally tens of millions or even billions).

The future of genome study is closely linked to data mining. This is an exciting time to be a data miner.

Reference:
[1] 23andMe press release, “23andMe Raises More Than $50 Million in New Financing”, December 11, 2012. http://mediacenter.23andme.com/press-releases/23andme-raises-more-than-50-million-in-new-financing/