Symantec is the largest anti-virus software vendor. It has 120 million subscribers, who visit 2 billion websites a day and generate 700 billion submissions. Given such a large number of data, it is paramount that an anti-virus software can detect the virus fast and accurately.
Anti-virus software was originally built manually. Security expert review each malware and construct their “signature”. Each computer file is checked against such signatures. Given the rapid change of malware and many variations, there are not enough human experts to generate all the exact signatures. This gives rise to heuristic or generic signatures which can handle more variations of the same file. However, new types of malware are created every day. Thus we need a more adaptive approach to identify malware automatically (without manual effort of creating signatures). This is where machine learning can help.
Computer virus has come a long way. The first virus “creeper” appeared in 1971. Then we have Rabbit or Wabbit. After that came computer worms like “Love Letter” and Nimda. Today computer virus gets much more sophisticated. It evolves much faster and is constantly changing. Virus creation is now funded by organizations and some governments. There is big incentive to steal user financial information or companies’ trade secrets. In addition, malware enables certain governments to conduct spying or potential cyber war on their targets.
Symantec uses about 500 features for their machine learning model. The feature value can be continuous or discrete. Such features include:
• How did it come this machine (through browser, email, ..)
• How many other files on this machine?
• How many clean files on this machine?
• Is file packed or obfuscated? (mutated?)
• Does it write, communicate?
• How often does it run?
• Who runs it?
Researchers at Symantec experiment with SVM, decision tree and linear regression models.
In building a classifier, they are not simply optimizing accuracy or true positive rate. They are also concerned false positive instances where a benign software was classified as malware. Such false positive prediction could have high cost for the users. The balance of true positive vs. false positive leads to using ROC (Receiver Operating Characteristic) curve.
An ROC curve plots the trade-off between true positive rate vs. false positive rate. Each point on the curve corresponds to a cutoff we choose. They use ROC curve to select a target point. Below is an illustration of the tradeoff.
The chart above suggests that when we aim for 90% true positive rate, we will have 20% false positive rate. However, when we only aim for 80% true positive rate, the false positive rate be reduced to 20%. (A better classifier could shift the ROC curve up, so that we achieve high true positive rate for any given false positive rate.)
According their researcher, Symantec has achieved high accuracy rate (the average of True positive and true negative rate) at 95%. Its true positive rate is above 98% and its false positive rate is below 1%.
I am a user of Norton software (by Symantec) and enjoy it. I hope to see more success from Symantec and we are winning the war against malware!