Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Evgeniy Gabrilovich and Shaul Markovitch. Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization. Journal of Machine Learning Research, 8:2297-2345 2007.

Abstract

Most existing methods for text categorization use induction algorithms in conjunction with the bag of words document representation. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been a number of attempts to augment the bag of words approach with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project---the largest existing Web directory, built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.

Keywords: Feature Generation, Text Categorization, Information Retrieval, Explicit Semantic Analysis, ESA

Secondary Keywords:

Online version:

Bibtex entry:

 @article{Gabrilovich:2007:HEH,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization},
  Year = {2007},
  Journal = {Journal of Machine Learning Research},
  Volume = {8},
  Number = {},
  Month = {Oct},
  Pages = {2297--2345},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-JMLR2007.pdf},
  Keywords = {Feature Generation, Text Categorization, Information Retrieval, Explicit Semantic Analysis, ESA},
  Secondary-keywords = {Common-Sense Knowledge, Feature Constrcution},
  Abstract = {
    Most existing methods for text categorization use induction
    algorithms in conjunction with the bag of words document
    representation. While they perform well in many categorization
    tasks, these methods are inherently limited when faced with more
    complicated tasks where external knowledge is essential. Recently,
    there have been a number of attempts to augment the bag of words
    approach with external knowledge, including semi-supervised
    learning and transfer learning. In this work, we present a new
    framework for automatic acquisition of world knowledge and methods
    for incorporating it into the text categorization process. Our
    approach enhances machine learning algorithms with features
    generated from domain-specific and common-sense knowledge. This
    knowledge is represented by ontologies that contain hundreds of
    thousands of concepts, further enriched by several orders of
    magnitude through controlled Web crawling. Prior to text
    categorization, a feature generator analyzes the documents and
    maps them onto appropriate ontology concepts that augment the bag
    of words. Feature generation is accomplished through contextual
    analysis of document text, thus implicitly performing word sense
    disambiguation. Coupled with the ability to generalize concepts
    using the ontology, this approach addresses the two main problems
    of natural language processing---synonymy and polysemy.
    Categorizing documents with the aid of knowledge-based features
    leverages information that cannot be deduced from the training
    documents alone. We applied our methodology using the Open
    Directory Project---the largest existing Web directory, built by
    over 70,000 human editors. Experimental results over a range of
    datasets confirm improved performance compared to the bag of words
    document representation.
  }

  }