Apr 30, 2013

The future of video mining

IP camera has become inexpensive and ubiquitous. For $149, you can get an IP camera to use at home. This camera saves recording in the cloud, and you never have to worry about buying memory cards. Dropcam offers this amazing service. The camera connects to your wi-fi network and stream real-time video 24 hours a day to the server. You can watch your home remotely from your smart phone or laptop.

Dropcam is a startup based in San Francisco. Founded in 2009, the company has raised 2 rounds of funding, with $12 million of Series B in June 2012. As a young startup, the company has seen rapid user growth. Dropcam CEO Greg Duffy said that Dropcam cameras now upload more video per day than YouTube.

What makes Dropcam unique from other IP camera company is its video mining capability. If you want to review your video of the last 7 days, you don’t have to watching them from the beginning to the end. Dropcam automatic mark the video segment with motion, so that you can jump to those segments right away.

In addition, Dropcam plans to implement face and figure detection so that it can give more intelligent viewing options. Furthermore, the video view software can detect events such as cat running, dinner gathering and so on. The potential of video mining is limitless. Given our limited time to review videos of many hours (or days and weeks) and continuing accumulation of our daily recording, the need for video mining will keep growing.

Apr 29, 2013

Stroke Prediction

Stroke is the third leading cause of death in the United States. It is also the principal cause of serious long-term disability. Stroke risk prediction can contribute significantly to its prevention and early treatment. Numerous medical studies and data analyses have been conducted to identify effective predictors of stroke.

Traditional studies adopted features (risk factors) that are verified  by  clinical trials  or  selected manually by medical experts. For example, one famous study by Lumley and others[1] built a 5-year  stroke prediction model using a set of 16 manually selected features. However, these manually selected features could miss some important indicators. For example,  past studies  have shown that there exist additional risk factors  for stroke such  as  creatinine level,  time  to  walk  15 feet,  and others.

The Framingham Study [2] surveyed a wide range of stroke risk factors including blood pressure, the use of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease, and atrial fibrillation. With  a large  number of features in current medical  datasets, it  is a  cumbersome task to  identify  and  verify  each  risk  factor  manually.  Machine learning algorithms are capable of identifying features highly related to stroke occurrence efficiently from the huge set of features. By doing so, it can improve the prediction accuracy of stroke risk, in addition to discover new risk factors.

In a study by Khosla and others [3], a machine-learning based predictive model was built on stroke data, and several feature selection methods were investigated. Their model was based on automatically selected features. It outperformed existing stroke model. In addition, they were able to identify risk factors that have not been discovered by traditional approaches. The newly identified factors include:

Total medications
Any ECG  abnormality
Min.  ankle  arm  ratio
Maximal  inflation level
Calculated 100 point score
General  health
Minimental score 35 point

It’s exciting to see machine learning play a more important role in medicine and health management.


[1]  T.  Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February 2002.
[2] P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: A risk profile from the Framingham study. Stroke, 22:312{318, March 1991.
[3] Aditya Khosla, Yu Cao, Cliff Chiung-Yu Lin, Hsu-Kuang Chiu, Junling Hu, and Honglak Lee. "An integrated machine learning approach to stroke prediction." In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 183-192. ACM, 2010.

Apr 26, 2013

History of machine learning

The development of machine learning is an integral part of the development of artificial intelligence. In the early days of AI, people were interested in building machines that mimic human brains. The perceptron model was invented in 1957, and it generated over optimistic view for AI during 1960s. After Marvin Minsky pointed out the limitation of this model in expressing complex functions, researchers stopped pursuing this model for the next decade. .

In 1970s, the machine learning field was dormant, when expert systems became the mainstream approach in AI.  The revival of machine learning came in mid-1980s, when the decision tree model was invented and distributed as software. The model can be viewed by a human and is easy to explain. It is also very versatile and can adapt to widely different problems. It is also in mid 1980s multi-layer neural networks were invented, With enough hidden layers, a neural network can express any function, thus overcoming the limitation of perceptron. We see a revival of the neural network study.

Both decisions trees and neural networks see wide application in financial applications such as loan approval, fraud detection and portfolio management. They are also applied to a wide-range of industrial process and postal office automation (address recognition). 
Machine learning saw rapid growth in 1990s, due to the  invention of World-Wide-Web and large data gathered on the Internet. The fast interaction on the Intern called for more automation and more adaptivity in computer systems. Around 1995, SVM was proposed and have become quickly adopted. SVM packages like libSVM, SVM light make it a popular method to use. 

After year 2000, Logistic regression was rediscovered and re-designed for large scale machine learning problems . In the ten years following 2003, logistic regression has attracted a lot of research work and has become a practical algorithm in many large-scale commercial systems, particularly in large Internet companies.  

We discussed the development of 4 major machine learning methods. There are other method developed in parallel, but see declining use today in the machine field: Naive Bayes, Bayesian networks, and Maximum Entropy classifier (most used in natural language processing). 

In addition to the individual methods, we have seen the invention of ensemble learning, where several classifiers are used together, and its wide adoption today. 

New machine learning methods are still invented each day. For the newest development, please check out the annual ICML (International Conference on Machine Learning) conference. 

Apr 25, 2013

Yelp and Big data

In the Big Data Gurus meetup yesterday, hosted in Samsung R&D center in San Jose, Jimmy Retzlaff from Yelp gave a talk on Big data at Yelp.

By the end of March 2013, Yelp has 36 million user reviews. These reviews cover from restaurants to hair salon, and other local businesses. The number of reviews on Yelp website has grown exponentially in the last few years.

Yelp also sees high traffic now. In January 2013, there are 100 million unique visitors to Yelp. The website records 2 terabytes of log data and another 2 terabytes of derived log every day. While this data size is still small comparing to eBay or LinkedIn, it calls for implementation of big data infrastructure and data mining methods.

Yelp uses MapReduce extensively and builds its infrastructure on Amazon cloud.

Yelp’s log data contain ad display, user clicks and so on. Data mining helps Yelp in designing search system, showing ads, and filter fake reviews. In addition, data mining enables products such as "review highlights", "people who viewed this also viewed...".

Yelp is one example of companies starting to tackle big data and taking advantage of data mining for creating better services.

Apr 24, 2013

Machine Learning for Anti-virus Software

Symantec is the largest anti-virus software vendor. It has 120 million subscribers, who visit 2 billion websites a day and generate 700 billion submissions. Given such a large number of data, it is paramount that an anti-virus software can detect the virus fast and accurately.

Anti-virus software was originally built manually. Security expert review each malware and construct their “signature”. Each computer file is checked against such signatures. Given the rapid change of malware and many variations, there are not enough human experts to generate all the exact signatures. This gives rise to heuristic or generic signatures which can handle more variations of the same file. However, new types of malware are created every day. Thus we need a more adaptive approach to identify malware automatically (without manual effort of creating signatures). This is where machine learning can help.

Computer virus has come a long way. The first virus “creeper” appeared in 1971. Then we have Rabbit or Wabbit. After that came computer worms like “Love Letter” and Nimda. Today computer virus gets much more sophisticated. It evolves much faster and is constantly changing. Virus creation is now funded by organizations and some governments. There is big incentive to steal user financial information or companies’ trade secrets. In addition, malware enables certain governments to conduct spying or potential cyber war on their targets.

Symantec uses about 500 features for their machine learning model. The feature value can be continuous or discrete. Such features include:
How did it come this machine (through browser, email, ..)
How many other files on this machine?
How many clean files on this machine?
Is file packed or obfuscated? (mutated?)
Does it write, communicate?
How often does it run?
Who runs it?

Researchers at Symantec experiment with SVM, decision tree and linear regression models.

In building a classifier, they are not simply optimizing accuracy or true positive rate. They are also concerned false positive instances where a benign software was classified as malware. Such false positive prediction could have high cost for the users. The balance of true positive vs. false positive leads to using ROC (Receiver Operating Characteristic) curve.

An ROC curve plots the trade-off between true positive rate vs. false positive rate. Each point on the curve corresponds to a cutoff we choose. They use ROC curve to select a target point. Below is an illustration of the tradeoff.

The chart above suggests that when we aim for 90% true positive rate, we will have 20% false positive rate. However, when we only aim for 80% true positive rate, the false positive rate be reduced to 20%. (A better classifier could shift the ROC curve up, so that we achieve high true positive rate for any given false positive rate.)

According their researcher, Symantec has achieved high accuracy rate (the average of True positive and true negative rate) at 95%. Its true positive rate is above 98% and its false positive rate is below 1%.

I am a user of Norton software (by Symantec) and enjoy it. I hope to see more success from Symantec and we are winning the war against malware!