Apr 22, 2014

Product attribute extraction

One of the most popular application for named entity detection is product attribute extraction. Automatically extracting product attributes from text helps E-commerce company correctly process user query and match it the right products. Marketers want to know what people are talking about related to their products. Corporate strategists wants to find out what products are trendy based on user discussions. 

Product attributes are specific properties related to a product. For example, Apple iPhone 5 with black color has 4 major attributes: Company name as “Apple”, brand name as “iPhone”, generation as “5”, color as “black”. A complete set of attributes define a product.  

What challenge do we face in extracting product attributes? We can create a dictionary of products and their corresponding attributes, but simply relying on a dictionary has its limitations:

  • A dictionary is not complete as new products are created. 
  • There is ambiguity in matching the attributes. 
  • Misspelling, abbreviations, and acronyms are used often , particularly in social media such as Twitter. 

Let’s look a case study of eBay (results were published), and see how they solve this problem. eBay is an online marketplace for sellers and buyers. Small sellers create their listings to be sold on eBay. With more than 100 million listings  on the site, sellers are selling everything from digital cameras, clothes, cars to collectibles. In order to help users quickly find an item they are interested in, eBay’s (product) search engine has to quickly match user query to existing listings. It is crucial that eBay group existing listing into products so that the search can be done quickly. This requires automatically extracting product attributes from each listing. However, they face the following challenges:

  • Each listing is less than 55 character long, which contains little context. 
  • Text is ungrammatical, with many nouns piled together.
  • Misspelling, abbreviations and acronyms occur frequently

As we can see, a dictionary-based approach does not work here due to the large volume of new listings and products on the site, and high variation in stating the same product attributes. For example, the brand “River island” has the following appearance in listings: 
river islands, river islanfd, river islan, ?river island, riverislandtop. 

How do we apply machine learning to this problem?  In a supervised learning approach, a small set of listings are labeled by humans. In each listing, each word is tagged either with an attribute name or “Other”. Different product categories have different attributes. For example, in clothing category, there are 4 major product attributes: Brand, garment type, color and size. The following is an example of a labeled listing:

                    Ann Taylor Asymmetrical  Knit Dress     NWT Red   Petite Size S
                    Brand                               Garment type          Color  Size        size

But supervised learning is very expensive as it requires a lot of labeled data. Our goal is using a small amount of labeled data and derive the rest based on bootstrapping. This is called semi-supervised learning. For detail of our semi-supervised learning approach, please see the paper in the reference. 

Duangmanee Putthividhya and Junling Hu, "Bootstrapped Named Entity Recognition for Product Attribute Extraction". EMNLP 2011: 1557-1567