Dec 10, 2012

Introduction to N-grams

In analyzing text documents, we can count the frequency of words appearing together in a fixed order. For example, we can count 2-word phrases like “baseball game”, “baseball card”, and “baseball player” etc. Similarly we can count 3-word phrase like “baseball game online”, “baseball game today”, and “baseball game reports”. This approach of counting all adjacent n words in a document is called n-gram approach. The 1-grams are all individual words in the documents, 2-grams are the adjacent 2-word phrases, and so on.

For example, in the following sentence:
A Major League Baseball game was held in Salt Lake City 40 years ago.
1-grams are: {a, Major, League, Baseball, game, was, held, in, Salt, Lake, City, 40, years, ago}
2-grams are: {a Major, Major League, League Baseball, Baseball game, game was, was held, held in, ... }
3-grams are: {a Major League, Major League Baseball, League Baseball game, Baseball game was,...}
Similarly we can count 4-grams, 5-grams and so on. 

Google has an n-gram viewer that counts n-grams in all Google books with certain periods. The chart in the beginning of this blog shows the count for 2-grams “baseball game”, “baseball card”, and “baseball player” in Google books between 1950 and 2008. 

The count of n-grams can be used to predict people’s next search keywords. Suppose we have a collection of all search keywords in the last 2 years. In this collection, the 2-gram “baseball express” has the highest count, followed by “baseball cards”. The 2-gram with higher count is more likely to appear with the 2-grams with lower count. Thus after a user types “baseball”, the next word is more like to be “express” than “cards” in the online search scenario. Here is a snapshot of Yahoo! Search suggestion.
N-grams are also used in spell checking, where recommended words are determined by the surrounding words. They are used in speech recognition, where word identification is based on words uttered before.