Jul 9, 2013

Text Mining: Name Entity Detection (2)

In the machine learning approach for detecting named entities, we first create labeled data, which are sentences with each word tagged. The tags correspond to the entities of interest. For example,
May      Smith   was  born   in   May.
        Person  Person                        Date
Since an entity may have several words, we can use finer tags in IOB (in-out-begin) format, which indicates whether the word is the beginning (B) or inside (I) of a phrase. We will have tags for B-Person, I-Person, B-Date, I-Date and O (other).  With such tagging, the above example becomes
May          Smith      was  born  in    May.
B-Person  I-Person   O     O    O   B-Date

Our training data have every word tagged. Our goal is learning about this mapping and apply it a new sentence. In other words, we want to find a mapping from a word (given its context) to an entity label.

Each word can be represented as a set of features. A machine learning model maps input features to an output (entity class). Such features could include:
Word identity (the word itself)
Word position from the beginning.
Word position from the end.
the word before
the word after
Is the word capitalized?

Our training data would look like the following, where each row is a data point.
Word identity
position from the beginning
position to the end
word before
word after
Is capitalized?

Once we create the training data like the above table, we can then train a standard machine learning method such as SVM, decision tree or logistic regression to generate a model (which contains machine generated rules). We can then apply this model to any new sentence and detect entity types related to "Person" and "Date".

Note that we can only detect entity types contained in our training data. While this may seem a limitation, it allows us to discover new instances or values associated with existing entity types.

The approach we have discussed so far is a supervised learning approach, which depends heavily on human-labeled data. When manual labels are hard to come by, we can find ways to create training data by machine. Such approach is called semi-supervised learning, which holds a lot of promise when data get large.


  1. Does your blog have a rss adress? thank you!

  2. Im very like your site. Extremely decent hues and subject. If it's not too much trouble answer back as I'm endeavoring to make my own site and would love to know where you got this from or precisely what the topic is named. Here is it Much appreciated!

  3. Great Site... This article is an impeccable bit of work.

    Phone System

  4. careful voice reaction, is a what blessings telephone structures to process touch tones or voice waves amidst a phone call.
    Ivr solutions

  5. i like visit your site , because learn me .. which good or bad cara baca handicap Maxbet online

  6. You can share Best Hindi status with your family and freinds.
    Top Hindi status
    Best Romantic Status in Hindi
    Sad love status

  7. I have experienced most of mobile platforms. IOS and BBOS are best in terms of efficiency and security. BB is best in productivity. IOS is good too. Both have no problem of any bloatware. Windows is next good in terms of efficiency, but iphones are next to perfect for business and personal use to get more news feeds about iPhones and Apple's new released stay connect with CydiaNerd.

  8. Check Aadhar card, pan card and Passport status.
    Aadhar card status
    Pan card

  9. Download Best Application for PC and Android.

  10. good disable the automated download and also storage of all WhatsApp Login use of an Android smart device, you could get a copy nice.


  11. Very well written admin and I love to read more info, so keep updating we will happy to visit again and again to your blog... Best
    French Open Live streaming