Pages

Jul 9, 2013

Text Mining: Name Entity Detection (2)

In the machine learning approach for detecting named entities, we first create labeled data, which are sentences with each word tagged. The tags correspond to the entities of interest. For example,
May      Smith   was  born   in   May.
        Person  Person                        Date
Since an entity may have several words, we can use finer tags in IOB (in-out-begin) format, which indicates whether the word is the beginning (B) or inside (I) of a phrase. We will have tags for B-Person, I-Person, B-Date, I-Date and O (other).  With such tagging, the above example becomes
May          Smith      was  born  in    May.
B-Person  I-Person   O     O    O   B-Date

Our training data have every word tagged. Our goal is learning about this mapping and apply it a new sentence. In other words, we want to find a mapping from a word (given its context) to an entity label.

Each word can be represented as a set of features. A machine learning model maps input features to an output (entity class). Such features could include:
Word identity (the word itself)
Word position from the beginning.
Word position from the end.
the word before
the word after
Is the word capitalized?

Our training data would look like the following, where each row is a data point.
Word identity
position from the beginning
position to the end
word before
word after
Is capitalized?
Class
‘May’
0
5
/
‘Smith’
yes
B-Person 
‘Smith’
1
4
‘May’
‘was’
yes
I-Person
‘May’
5
0
‘in’
/
yes
B-Date

Once we create the training data like the above table, we can then train a standard machine learning method such as SVM, decision tree or logistic regression to generate a model (which contains machine generated rules). We can then apply this model to any new sentence and detect entity types related to "Person" and "Date".

Note that we can only detect entity types contained in our training data. While this may seem a limitation, it allows us to discover new instances or values associated with existing entity types.

The approach we have discussed so far is a supervised learning approach, which depends heavily on human-labeled data. When manual labels are hard to come by, we can find ways to create training data by machine. Such approach is called semi-supervised learning, which holds a lot of promise when data get large.

14 comments:

  1. Does your blog have a rss adress? thank you!

    ReplyDelete
  2. Im very like your site. Extremely decent hues and subject. If it's not too much trouble answer back as I'm endeavoring to make my own site and would love to know where you got this from or precisely what the topic is named. Here is it http://www.business2community.com/twitter/top-10-twitter-tools-definitely-need-use-01386197#CdQ5P028q5qCJvxP.97 Much appreciated!

    ReplyDelete
  3. Great Site... This article is an impeccable bit of work.

    Phone System

    ReplyDelete
  4. careful voice reaction, is a what blessings telephone structures to process touch tones or voice waves amidst a phone call.
    Ivr solutions

    ReplyDelete
  5. i like visit your site , because learn me .. which good or bad cara baca handicap Maxbet online

    ReplyDelete
  6. You can share Best Hindi status with your family and freinds.
    Top Hindi status
    Best Romantic Status in Hindi
    Sad love status

    ReplyDelete