Pages

Jul 9, 2013

Text Mining: Name Entity Detection (2)

In the machine learning approach for detecting named entities, we first create labeled data, which are sentences with each word tagged. The tags correspond to the entities of interest. For example,
May      Smith   was  born   in   May.
        Person  Person                        Date
Since an entity may have several words, we can use finer tags in IOB (in-out-begin) format, which indicates whether the word is the beginning (B) or inside (I) of a phrase. We will have tags for B-Person, I-Person, B-Date, I-Date and O (other).  With such tagging, the above example becomes
May          Smith      was  born  in    May.
B-Person  I-Person   O     O    O   B-Date

Our training data have every word tagged. Our goal is learning about this mapping and apply it a new sentence. In other words, we want to find a mapping from a word (given its context) to an entity label.

Each word can be represented as a set of features. A machine learning model maps input features to an output (entity class). Such features could include:
Word identity (the word itself)
Word position from the beginning.
Word position from the end.
the word before
the word after
Is the word capitalized?

Our training data would look like the following, where each row is a data point.
Word identity
position from the beginning
position to the end
word before
word after
Is capitalized?
Class
‘May’
0
5
/
‘Smith’
yes
B-Person 
‘Smith’
1
4
‘May’
‘was’
yes
I-Person
‘May’
5
0
‘in’
/
yes
B-Date

Once we create the training data like the above table, we can then train a standard machine learning method such as SVM, decision tree or logistic regression to generate a model (which contains machine generated rules). We can then apply this model to any new sentence and detect entity types related to "Person" and "Date".

Note that we can only detect entity types contained in our training data. While this may seem a limitation, it allows us to discover new instances or values associated with existing entity types.

The approach we have discussed so far is a supervised learning approach, which depends heavily on human-labeled data. When manual labels are hard to come by, we can find ways to create training data by machine. Such approach is called semi-supervised learning, which holds a lot of promise when data get large.

1 comment: