Pages

Jul 8, 2013

Text Mining: Named Entity Detection

An interesting task of text mining is detecting entities in the text. Such entities could be a person, a company, a product, or a location. Since an entity is associated with a special name, it is also called Named Entity. For example, the following text contains 3 named entities:
       Apple has hired Paul Deneve as vice president, reporting to CEO Tim Cook.
The first term “Apple” indicates a company, and the second and third are persons.

Named entity detection (NER) is an important component in  social media analysis. It helps us to understand user sentiment on specific products. NER is also important for product search for E-commerce companies. It helps us to understand user search query related to certain products.

To map each name to an entity, one solution is using a dictionary of special names. Unfortunately, this approach has two serious problems. The first problem is that our dictionary is not complete. New companies are created and new products are sold every day. It is hard to keep track all the new names. The second problem is the ambiguity of associating a name to an entity. The following example illustrates this:
As Washington politicians argue about the budget reform, it is a good time to look back at George Washington’s time.
In this text, the first mention of “Washington” refers to a city, while the second mention refers to a person. The distinction of these two entities comes from their context.

To resolve ambiguity in entity mapping,  we can create certain rules to utilize the context. For example, we can create the following rules:
  1. When ‘Washington’ is followed by ‘politician’, then it refers to a city.
  2. When ‘Washington’ is preceded by ‘in’, then it refers to a city.
  3. When ‘Washington’ is preceded by ‘George’, then it refers to a person.
But such rules could be too many. For example, each of the following phrases would generate a different rule: “Washington mentality”, “Washington atmosphere”, “Washington debate” as well as “Washington biography” and “Washington example”. The richness of natural language makes the number of rules exploding and still susceptible to exceptions.

Instead of manually creating rules, we can apply machine learning. The advantage of machine learning is that it creates patterns automatically from examples. No rule needs to be manually written by humans. The machine learning algorithm takes a set of training examples, and chunk out its own model (that is comparable to rules). If we get new training data, we can re-train the machine learning algorithm and generate a new model quickly.

How does the machine learning approach work? I will discuss it in the next post. 

3 comments:

  1. Thank you for this post. Very interesting.

    ReplyDelete
  2. I think it is amazing that the search engine will try to interpret and answer my question. It sounds like program that was playing jeopardy. What was it called? Watson?
    As the technology grows it will reduce the really odd search results that pop up. To see more info please visit http://essayhogwarts.com/dissertation-proposal/.

    ReplyDelete
  3. Interesting post. I will wait for your next post on machine learning.

    ReplyDelete