Feb 21, 2013

Data Mining and Neuroscience

Bradley Voytek from University of California at SF gave a talk on data mining and neuroscience yesterday in Mountain View. This is part of Big Data Think Tank meetup.

Voytek said, data mining could play an important role in brain study and treatment. Imagine applying data mining to reduce open-skull surgery from 45 minutes to 2 minutes, imagine a better way to analyze and understand fMRI data. Imagine we can apply data mining to understand aging and brain response, helping us to identify ways to improve cognitive function for the elderly. In addition, not mentioned in this talk, are recent studies on applying data mining (specifically association rule mining) to understand Alzheimer’s disease, identifying associations in brain region change.

Voytek also mentioned the need to understand the network of neural signals. This is an exciting domain as it will ultimately help us to improve human cognition, such as hearing and vision. As reported recently in the news, the implementation of “bionic eyes” depends on the mapping of neurons corresponding to visual processing. Deeper understanding of neurons for this function could help us eradicate blindness completely. Imagine a future when there is no more blindness or deafness. How much human suffering can be eliminated!

Neuroscience is the next frontier of science. Understanding human brain and ultimately human consciousness will resolve age-old question on self and soul. With that understanding, imagine we can preserve consciousness or even human memory. It is not far-fetched to think we can watch another person’s memory like watching a movie, if we can truly understand those neural signals (associated memory retrieval).

It is exciting to see data mining can play an important role of in the advancement of science, particularly in neuroscience.

Feb 11, 2013

Data Mining vs. Machine Learning

Data mining and machine learning used to be two cousins. They have different parents. Now they grow increasingly like each other, almost like twins. Many times people even call data mining by the name Machine learning.

The field of machine learning grew out of the effort of building artificial intelligence. Its major concern is making a machine learn and adapt to new information. The origin of machine learning can be traced back to 1957 when the perceptron model was invented. This is modeled after neurons in human brain. That prompted the development of neural network model, which flourished in late 1980s. From 1980s to 1990s, the decision tree method has become very popular, owing to the efficient package of C4.5. SVM was invented in mid-1990s and it has since been widely used in industry. Logistic regression, an old method in statistics, has seen growing adoption in machine learning after 2001 when the book on statistical learning (The Elements of Statistical Learning) was published.

The field of data mining grows out of knowledge discovery from databases. In 1993, a seminal paper by Rakesh Agrawal and two others proposed an efficient algorithm of mining association rules in large databases. This paper promoted many research papers on discovering frequent patterns and more efficient mining algorithms. The early work of data mining in 1990s was linked to creating better SQL statement and working with databases directly.

Data mining has its strong focus on working with industrial problems and getting practical solutions. Therefore it concerns with not only data size (large data), but also data processing speed (stream data). In addition, personalized recommender systems and network mining are all developed due to business need, outside the machine learning field.  

The two major conferences for data mining are KDD (Knowledge Discovery and Data Mining) and ICDM (International Conference on Data Mining). The two major conferences for machine learning are ICML (International Conference on Machine Learning) and NIPS (Neural Information Processing Systems).  Machine learning researchers attend both types of conferences.However, the data mining conferences have much stronger industrial link.

Data Miners typically have strong foundation in machine learning, but also have a keen interesting in applying it large-scale problems.

Over time, we will see deeper connection between data mining and machine learning. Could they become twins one day? Only time will tell. 

Feb 10, 2013

The future of genome data mining

23andMe is a startup based in Mountain View, California. Founded in 2006, its core business is genome sequencing for individuals, and providing additional information on your ancestry and possible disease risk, which you can access on their website. 

The cost of sequencing a person’s genome used to be prohibitive. However, 23andMe with its deep pocket of venture and personal funding (Co-founder Ann Wojcicki is the wife of Google co-founder Sergey Brin), was able to cut the sequencing price from $999 in 2007 to $399 in 2008, then to $299 until end of 2012. In December 2012, with $50 Million Series D funding, 23andMe slashed the price to $99 per person. Such price is probably below their actual testing cost. Why the price cut? 23andMe states that their goal is to get 1 million people participate.

What is the drive behind the large expansion of the user base? The first potential is disease discovery. With a large population, a disease can be more solidly linked to genome data. Suppose we find gene mutation in 1 diabetes patient, it is not enough to conclude that the mutation caused her diabetes. However, if we find the same gene mutation in 1000 diabetes patients, we can be more confident to draw this conclusion. Ultimately it is getting a large enough size of population sample so that we can uniquely link a segment of the gene mutation or ancestral traits to a disease.

By December 2012 (before the price slash), 23andMe has accumulated 180,000 individual genome profiles [1]. So far, this is the largest dataset any one organization has accumulated on human genomes. Combined with the self-reported health profiles of these customers, studies of disease link to gene patterns can be done more conclusively.    

23andMe has partnered with Genentech to study a range of diseases from Alzhermer’s, to breast cancer, and (mostly recently) Avastin. In addition, the company received a small funding from NIH to study allergy and asthma. Given the large population data of genomes, we could see some exciting discovery.

Data mining will play a big role in these new discoveries. Note only data mining enables pattern discovery in a large data where there are many different diseases and persona traits, it can also create predictive models on disease onset related to person’s genome profile. The feature selection technique from data mining also has worked well on genome study where there are more than 20,000 gene features but only a few data points. Even with 1 million people in the data, the problem of small data points could still exist when only a small of group of people have similar diseases (Thus it is important to get even data from more people, ideally tens of millions or even billions).

The future of genome study is closely linked to data mining. This is an exciting time to be a data miner.

[1] 23andMe press release, “23andMe Raises More Than $50 Million in New Financing”, December 11, 2012.

Feb 9, 2013

Graph Mining

With the rapid rise of social network (through Facebook), professional network (through LinkedIn), and short-news report network (through Twitter), computer scientists have taken a keen look of network mining. In the terminology of computer science, a network is a graph. A graph is defined as a collection of nodes and edges. Therefore network mining is also called graph mining.

The most successful application of graph mining is web search, where the Internet is modeled as a network, and webpages are nodes. Each webpage is ranked based on their link strength.

Other applications of graph mining include:
1.    Understand molecule structure for drug discovery [1]

2.    Predict the spread of infectious diseases.

Any social network can be modeled as a graph, where nodes are people and edges are their relationship. On Facebook, the edges are "friend". On Twitter, the edges become “followed by”. On LinkedIn, the edges are “connection”. On eBay, the edges can be “sold to”.

How do we make use of the network structure? For marketers, finding out top influencers in a social network can be very useful. These people would influence a lot of people with their opinions.  Marketing can be much more effective by focusing on these influencers.

How do we discover influencers? In Facebook, these are people who have a lot of friends and whose postings get a lot of comments. In Twitter, these people have many followers and whose tweets are retweeted often. Note that it is possible for a top influencer to have a small number of friends (or followers), as long as these friends (or followers) are top influencers. Such a person could be a “king maker”, who directly influences the most powerful/influential politician.

Mining top influencers therefore involves an algorithm like PageRank, which is successfully used to discover top web pages. The essence of PageRank algorithm is recursive calculation of the weights on each link. This can be applied to calculating top influencers, where their influence strength can be recursively based on their followers' influence level.

Given 100 million people in a network, mining top influencers is a computational challenge. Fortunately such computing can be parallelized as each node can by calculated simultaneously.  This is how Google invented MapReduce, and how Hadoop came to be. Essentially, the so-called “Big Data” is about providing parallel computing infrastructure (such as Hadoop). Graph mining pioneered big data computing.

[1] Takigawa, Ichigaku, and Hiroshi Mamitsuka. "Graph mining: procedure, application to drug discovery and recent advances." Drug Discovery Today, Volume 18, Issues 1–2, January 2013, Pages 50–57